Joint Latent Space Model For Social Networks With Multivariate Attributes
Joint Latent Space Model For Social Networks With Multivariate Attributes
Abstract
In many application problems in social, behavioral, and economic sciences, re-
searchers often have data on a social network among a group of individuals along
with high dimensional multivariate measurements for each individual. To analyze
such networked data structures, we propose a joint Attribute and Person Latent
Space Model (APLSM) that summarizes information from the social network and
the multiple attribute measurements in a person-attribute joint latent space. We
develop a Variational Bayesian Expectation-Maximization estimation algorithm to
estimate the posterior distribution of the attribute and person locations in the joint
latent space. This methodology allows for effective integration, informative visualiza-
tion, and prediction of social networks and high dimensional attribute measurements.
Using APLSM, we explore the inner workings of the French financial elites based on
their social networks and their career, political views, and social status. We observe
a division in the social circles of the French elites in accordance with the differences
in their individual characteristics.
1 Introduction
Understanding interactions among sets of entities often represented as complex net-
works, is a central research task in many data-intensive scientific fields, including Statis-
∗
This research is partially supported by a grant from National Science Foundation under Grant No.
DMS 1830547. The authors would like to thank Prof. Vishesh Karwa of Temple University, Prof. Srijan
Sengupta of North Carolina State University and Prof. Jessica Logan of The Ohio State University for
discussions that helped in conceptualizing the statistical models.
1
tics, Machine learning, Social sciences, Biology, Psychology, and Economics (Watts and
Strogatz, 1998; Barabási and Albert, 1999; Albert and Barabási, 2002; Jackson et al., 2008;
Girvan and Newman, 2002; Shmulevich et al., 2002; Bickel and Chen, 2009; Bullmore and
Sporns, 2009; Rubinov and Sporns, 2010; Carrington et al., 2005; Borgatti et al., 2009;
Wasserman et al., 1994; Lazer, 2011). However, a majority of methodological and applied
studies have only considered interactions of one type among a set of entities of the same
type. More recent studies have pointed to the heterogeneous and multimodal nature of
such interactions, whereby a complex networked system is composed of multiple types of
interactions among entities that themselves are of multiple types (Kivelä et al., 2014; Boc-
caletti et al., 2014; Mucha et al., 2010; Paul and Chen, 2016; Paul et al., 2020b; Sengupta
and Chen, 2015; Sun et al., 2009; Liu et al., 2014; Ferrara et al., 2014; He et al., 2014;
Nickel et al., 2016; Paul et al., 2020a).
Social relationships are known to affect individual behaviors and outcomes including
dementia (Fratiglioni et al., 2000), decision making (Kim and Srivastava, 2007), adolescent
smoking Mercken et al. (2010), and online behavior choices (Kwon et al., 2014). At the same
time, individual attributes, such as race, age, and gender, can affect whether and how people
form friendships or romantic partnerships (Dean et al., 2017; McPherson et al., 2001). The
effect of social relationships on individual behaviors is observed through disparities in the
outcomes across different individuals when their friendship structures differ. The reciprocal
is also observed through disparities in the friendship structures when individuals’ attributes
differ. Therefore, flexible joint modeling of social relationships and individual behaviors
and attributes is needed to investigate their interrelationships effectively.
A number of popular models for social networks have been extended to incorporate
nodal covariates in the literature, including, exponential random graph models (ERGMs)
(Lusher et al., 2013), stochastic blockmodels (Mele et al., 2019; Sweet, 2015) and latent
space models (Fosdick and Hoff, 2015; Krivitsky and Handcock, 2008; Hoff et al., 2002;
Austin et al., 2013). In these models, the social network links are treated as dependent
variables, and the effects of nodal covariates on the probability of network ties are sub-
sequently estimated. Alternatively, social influence models use the node-level attributes
as the dependent variables and estimate the effects of the social network on the attributes
(Dorans and Drasgow, 1978; Leenders, 2002; Robins et al., 2001; Sweet and Adhikari, 2020;
Frank et al., 2004; Fujimoto et al., 2013; Shalizi and Thomas, 2011; VanderWeele, 2011;
VanderWeele and An, 2013; Bramoullé et al., 2009; Goldsmith-Pinkham and Imbens, 2013;
Shalizi and Thomas, 2011).
A different approach is to develop a joint modeling framework where different types
2
of data are integrated by jointly modeling them as the dependent variables. In network
science, the joint modeling framework has been proposed to model multi-view or multiplex
networks, where multiple types of relations are observed on the same set of nodes, using
stochastic block models (Barbillon et al., 2015; Kéfi et al., 2016), and latent space models
(Gollini and Murphy, 2016; Arroyo et al., 2019; Salter-Townshend and McCormick, 2017;
D’Angelo et al., 2018; Zhang et al., 2020). In these models, latent variables are used to
explain the probability of a node being connected with other nodes in multiple types of
relations. When dependencies can be assumed across different layers, the common node
representation is a flexible framework to summarize multiple types of information.
In many cases, high-dimensional multivariate covariates with complex latent structures
are available in addition to a connected network. In particular, often a social network is
observed along with individuals’ attributes or behavioral outcomes. In this case, two types
of relations, the social network relations and various types of individual attributes, are
observed among two types of nodes, the person nodes, and the attribute nodes. To jointly
model such data, a multivariate normal distribution was fitted by Fosdick and Hoff (2015)
to the latent variables from the social network and the observed covariates. This work
is in spirit the closest to our proposed model. However, it restricts the covariates to be
normally distributed, and it does not take into account the multiple latent dimensions of the
covariates. A dynamic version of Fosdick and Hoff (2015) can be found in Guhaniyogi et al.
(2020) with possibilities to accommodate both categorical and continuous attributes. The
most important distinction of our model from this line of work is that we use a second set of
latent variables, the latent attribute variables, in addition to the latent person variables to
summarize the information associated with each attribute. This modeling framework allows
for the joint latent space, where two types of nodes are interactive instead of one. Other
related works that jointly model heterogeneous networks are recently seen with stochastic
block models including Huang et al. (2020); Sengupta and Chen (2015).
In this paper, we propose a joint latent space model for heterogeneous and multi-
modal networks with multiple types of relations among multiple types of nodes. The
proposed Attribute Person Latent Space Model (APLSM) merges information from the
social network and the multivariate covariates by assuming that the probabilities of a
node being connected with other same-type and different-type nodes are explained by la-
tent variables. This model has a wide range of applications. For example in computer
science, it is of interest to summarize relational data, e.g. likes and followers in social
media with other user information such as personalities, health outcomes, online behavior
choices, etc. In economics and business, it is of interest to summarize consumer informa-
3
tion with their geographic networks and social networks. We demonstrate APLSM with
a data set on the French financial elites (Kadushin, 1995) available to download from
http://moreno.ss.uci.edu/data.html#ffe. To fit the APLSM, we propose a Varia-
tional Bayesian Expectation-Maximization (VBEM) algorithm Blei et al. (2017). As an
intermediate step we also develop a VBEM algorithm for fitting latent space models to bi-
partite networks. The variational methods enable the models to be fitted to large networks
with high dimensional attributes, while our simulations show the accuracy of the methods.
The remainder of this paper is organized as follows. In section 2, we introduce latent
space models for bipartite networks and develop the variational inference approach to
estimate the model. In section 3, we introduce the joint latent space model for the social
network and the multivariate covariates and extend the variational inference method to the
joint model. In section 4, we assess the performance of the estimators with a simulation
study, and in section 5, we apply the proposed methodology to the French financial elite
data. Finally, in section 6, we summarize our findings.
We assume 𝒖 𝒊 to be the latent person variable, 𝒖 𝒊 ∼ 𝑁 (0, 𝜆20 I𝐷 ), where 𝛼0 , 𝜆20 are unknown
𝑖𝑖𝑑
4
parameters that need to be estimated and I𝐷 is the 𝐷 dimensional identity matrix. The
probability of an edge increases as the Euclidean distance between the two nodes decreases.
Here, we use the squared Euclidean distances |𝒖 𝒊 − 𝒖 𝒋 | 2 instead of the Euclidean distances
following (Gollini and Murphy, 2016). It has been shown in Gollini and Murphy (2016) that
squared Euclidean distances are computationally more efficient and that the latent positions
obtained using squared Euclidean distances are extremely similar to those obtained using
Euclidean distances.
Variations of the latent space model are further developed in Hoff (2005, 2008, 2009,
2018) and Ma et al. (2020). The extension of the latent distance model include Handcock
et al. (2007)’s latent position cluster model that allows for the clustering of nodes based on
the Euclidean distances. Replacing one Gaussian distribution with a mixture of Gaussians,
Handcock et al. (2007) was able to account for the possible latent community structure
in the networks. Additional random sender and receiver effects were added to the latent
position cluster model by Krivitsky et al. (2009). The latent space model have been further
extended to accommodate multiple networks Gollini and Murphy (2016); Salter-Townshend
and McCormick (2017), dynamic networks Sewell and Chen (2015), and bipartite networks
Friel et al. (2016). The majority of the works on latent space models described above have
utilized Bayesian estimation techniques including Markov Chain Monte Carlo (MCMC)
and Variational Inference (VI). Recently, Ma et al. (2020) has proposed two algorithms
based on nuclear norm penalization and projected gradient descent to fit the latent space
model with statistical consistency guarantees.
has attribute 𝑎, for 𝑖 = {1, 2, . . . , 𝑁 } and 𝑎 = {1, 2, . . . , 𝑀 }. Let 𝑽 be a 𝑀×𝐷 matrix of latent
attribute positions, each row of which is a 𝐷 dimensional vector 𝒗 𝒂 = (𝑣 𝑎1 , 𝑣 𝑎2 , . . . , 𝑣 𝑎𝐷 )
indicating the latent position of attribute 𝑎 in the Euclidean space.
5
The latent distance model for the bipartite network 𝒀𝑰 𝑨 can be written as:
exp(𝛼1 − |𝒖𝑖 − 𝒗 𝑎 | 2 )
𝑌𝑖𝑎𝐼 𝐴 |(𝑼, 𝑽, 𝛼0 ) ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑔(𝜙𝑖𝑎 )), 𝑔(𝜙𝑖𝑎 ) = ,
1 + exp(𝛼1 − |𝒖𝑖 − 𝒗 𝑎 | 2 )
The parameter 𝛼1 accounts for the density of the bipartite network. The probability of a
positive response increases as the Euclidean distance between the attribute node and the
person node decreases.
6
where ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 are the parameters of the variational distribution, known as variational
parameters.
We can estimate the variational parameters by minimizing the Kullback-Leiber (KL) di-
vergence between the variational posterior 𝑞(𝑼, 𝑽 |𝒀𝑰 𝑨 ) and the true posterior 𝑓 (𝑼, 𝑽 |𝒀𝑰 𝑨 ).
Minimizing the KL divergence is equivalent to maximizing the following Evidence Lower
Bound (ELBO) function Blei et al. (2017), (see detailed derivations in the Supplementary
Materials)
log 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
ELBO = −E𝑞(𝑼,𝑽,𝛼1 |𝒀𝑰 𝑨 )
log 𝑝(𝑼, 𝑽, 𝒀𝑰 𝑨 |𝛼1 )
∫
𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
=− 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 ) log 𝑑 (𝑼, 𝑽, 𝛼1 )
𝑓 (𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
𝑁 𝑀 Î𝑁 Î𝑀
𝑖=1 𝑞(𝒖 𝒊 ) 𝑎=1 𝑞(𝒗 𝒂 )
∫ Ö Ö
=− 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ) log Î𝑁 Î𝑀 𝑑 (𝑼, 𝑽, 𝛼1 )
𝑖=1 𝑎=1 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 ) 𝑖=1 𝑓 (𝒖 𝒊 ) 𝑎= 1 𝑓 (𝒗 𝒂 )
𝑁 ∫ 𝑀 ∫
∑︁ 𝑞(𝒖 𝒊 ) ∑︁ 𝑞(𝒗 𝒂 )
=− 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊 − 𝑞(𝒗 𝒂 ) log 𝑑𝒗 𝒂
𝑖=1
𝑓 (𝒖 𝒊 ) 𝑎=1
𝑓 (𝒗 𝒂 )
∫
+ 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 ) log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 )𝑑 (𝑼, 𝑽, 𝛼1 )
𝑁
∑︁ 𝑀
∑︁
=− KL[𝑞(𝒖 𝒊 )| 𝑓 (𝒖 𝒊 )] − KL[𝑞(𝒗 𝒂 )| 𝑓 (𝒗 𝒂 )] + E𝑞(𝑼,𝑽,𝛼1 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 )]
𝑖=1 𝑎=1
1 𝑁 tr( Λ̃ ) Í𝑁 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊
0
2
= − 𝐷𝑁 log(𝜆0 ) − 𝑁 log(det( Λ̃0 )) − 2
− 𝑖=1 2
2 2𝜆 0 2𝜆 0
Í𝑀
1
2
𝑀 tr( Λ̃1 ) 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
− 𝐷 𝑀 log(𝜆 1 ) − 𝑀 log(det( Λ̃1 )) − − + (𝑀 𝐷 + 𝑁 𝐷)
2 2𝜆21 2𝜆21 2
+ E𝑞(𝑼,𝑽 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽)]. (1)
After applying Jensen’s inequality (Jensen, 1906), a lower-bound on E𝑞(𝑼,𝑽 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽)]
is given by,
7
compute the expectation of the complete likelihood with respect to the conditional distri-
bution 𝑓 (𝑼, 𝑽 |𝒀𝑰 𝑨 ), with a VE step, where we compute the expectation with respect to the
best variational distribution (obtained by optimizing the ELBO function) at that iteration.
˜ 1(0) ,
The detailed procedures are as follows. We start with the initial parameter, Θ (0) = 𝛼
and ũ𝑖(0) , Λ̃0(0) , ṽ𝑎(0) , Λ̃1(0) , and then we iterate the following VE (Variational expectation)
and M (maximization) steps. During the VE step, we maximize the ELBO(𝑞(U), q(V), Θ)
with respect to the variational parameters 𝒖˜ 𝒊 , 𝒗˜𝒂 , 𝜆˜0 and 𝜆˜1 given the other model param-
eters and obtain ELBO(𝑞 ∗ (U), q∗ (V), Θ). During the M step, we fix 𝒖˜ 𝒊 , 𝒗˜𝒂 , Λ̃0 and Λ̃1
˜ 1 . To do this, we differenti-
and maximize the ELBO(𝑞(U), q(V), Θ) with respect to 𝛼
ate ELBO(𝑞(U), q(V), Θ) with respect to each variational parameter. We obtain closed
form update rules by setting the partial derivatives to zero while introducing the first- and
second-order Taylor series expansion approximation of the log functions in ELBO(𝑞(U), q(V)
, Θ) (see detailed derivations in supplementary material). The Taylor series expansions are
commonly used in the variational approaches. For example, three first-order Taylor ex-
pansions were used by Salter-Townshend and Murphy (2013) to simplify the Euclidean
distance in the latent position cluster model, and first- and second-order Taylor expansions
were used by Gollini and Murphy (2016) to simplify the squared Euclidean distance in
LSM. Following the previous publications using Taylor expansions, we approximate the
three log functions in our ELBO(𝑞(U), q(V), Θ) function to find closed form update rules
for the variational parameters. Define the function
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼
𝑭𝒊𝒂 = log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) .
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2
The closed form update rules of the (𝑡 + 1)th iteration are as follows
VE-step: Estimate 𝒖˜ 𝒊 (𝑡+1) , 𝒗˜𝒂 (𝑡+1) , Λ̃0(𝑡+1) and Λ̃1(𝑡+1) by minimizing ELBO(𝑞(U), q(V), Θ)
" 𝑀
# −1 " 𝑀 #
1 ∑︁ 1 ∑︁ 1 1
𝒖˜ 𝒊(𝒕+1) = 2
+ 𝑦𝑖𝑎 ) 𝑰 + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) 𝑦𝑖𝑎 𝒗˜(𝒕) ˜ 𝒊(𝒕) ) 𝒖˜ 𝒊(𝒕) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) )
𝒂 + 𝑯 𝒊𝒂 ( 𝒖
2𝜆 0 𝑎=1 2 𝑎=1
2 2
" 𝑁
! # −1 " 𝑁 #
(𝒕+1) 1 ∑︁ 1 (𝒕)
∑︁
(𝒕) 1 (𝒕)
𝒗˜𝒂 = + 𝑦𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 (˜ 𝒗𝒂 ) 𝑦𝑖𝑎 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 (˜
𝒗𝒂 )
2𝜆21 𝑖=1 2 𝑖=1
2
" 𝑁 𝑀
! # −1
(𝑡+1) 𝑁 𝑁 1 ∑︁ ∑︁ (𝑡)
Λ̃0 = + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃0 )
2 2 𝜆20 𝑖=1 𝑎=1
" 𝑁 ∑︁ 𝑀
! # −1
𝑀 𝑀 1 ∑︁
Λ̃1(𝑡+1) = + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃1(𝑡) ) , (2)
2 2 𝜆21 𝑖=1 𝑎=1
8
where 𝑮 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) )and 𝑮 𝑰 𝑨 (˜
𝒗 𝒂 (𝑡) ) are the partial derivatives (gradients) of 𝑭𝑰 𝑨 with respect
to 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡) and 𝒗˜𝒂 (𝑡) , respectively. In 𝑮 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) ), the subscript 𝑰 𝑨
indicates that the gradient is of function 𝑭𝑰 𝑨 , and the subscript 𝑖 in 𝒖˜ 𝒊 (𝑡) indicates that
the gradient is with respect to 𝒖˜ 𝒊 , evaluated at 𝒖˜ 𝒊 (𝑡) . Similarly, 𝑯 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) ) and 𝑯 𝑰 𝑨 (˜
𝒗 𝒂 (𝑡) )
are the second-order partial derivatives of 𝑭𝑰 𝑨 with respect to 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡)
and 𝒗˜𝒂 (𝑡) , respectively.
˜ 1(𝑡) ) + 𝛼 ˜ 1(𝑡) )
˜ 1(𝑡) ℎ 𝐼 𝐴 ( 𝛼
Í𝑁 Í 𝑀 𝐼 𝐴
(𝑡+1) 𝑖=1 𝑎=1 𝑦 𝑖𝑎 − 𝑔 𝐼 𝐴 ( 𝛼
˜1
𝛼 = , (3)
ℎ𝐼 𝐴 (𝛼 ˜ 1(𝑡) )
where 𝑔𝑖 (·) are the link functions, and 𝑝𝑖 (·|·) are the parametric families of distributions
suitable for the type of data in the matrices. We set the priors 𝒖 𝒊 ∼ 𝑁 (0, 𝜆20 I𝐷 ), and
𝑖𝑖𝑑
9
𝒗 𝒂 ∼ 𝑁 (0, 𝜆21 I𝐷 ). 𝛼0 , 𝛼1 , 𝜆20 , 𝜆21 are unknown parameters. Different levels of 𝛼0 and 𝛼1
𝑖𝑖𝑑
𝜃 𝑖,𝐼 𝑗 = 𝛼0 − |𝒖 𝒊 − 𝒖 𝒋 | 2 , and 𝐼𝐴
𝜃 𝑖,𝑎 = 𝛼1 − |𝒖 𝒊 − 𝒗 𝒂 | 2 , (4)
𝐼 𝐴 . If the
In Equation 4, squared euclidean distance forms are assumed for 𝜃 𝑖,𝐼 𝑗 and 𝜃 𝑖,𝑎
data in 𝒀𝑰 and 𝒀𝑰 𝑨 are binary, then the link functions 𝑔1 (𝜃 𝑖,𝐼 𝑗 ) and 𝑔2 (𝜃 𝑖,𝑎
𝐼 𝐴 ) are logistic
10
𝐼 𝐴 |𝜃 𝐼 𝐴 can be seen as follows:
Poisson model for the distribution of 𝑦𝑖,𝑎 𝑖,𝑎
𝐼𝐴
𝐼 𝐴 ))𝛾(𝜃 𝐼 𝐴 ) 𝑦 𝑖,𝑎
Ö exp(−𝛾(𝜃 𝑖,𝑎
𝐼 𝐴 =0)
(𝑦 𝑖,𝑎 𝑖,𝑎
𝐼𝐴 𝐼𝐴 𝐼𝐴 𝐼𝐴
𝑝 3 (𝑦𝑖,𝑎 |𝜃 𝑖,𝑎 ) = (1 − (𝜅(𝜃 𝑖,𝑎 )) × (𝜅(𝜃 𝑖,𝑎 ) 𝐼 𝐴!
𝑦𝑖,𝑎
𝐼 𝐴)
exp(𝜃 𝑖,𝑎
𝐼𝐴
𝜅(𝜃 𝑖,𝑎 ) =
1 + exp(𝜃 𝑖,𝑎
𝐼 𝐴)
𝐼𝐴 𝐼𝐴
𝛾(𝜃 𝑖,𝑎 ) = exp(𝜃 𝑖,𝑎 ).
where ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 are the parameters of the distribution, known as variational parameters.
The Evidence Lower Bound (ELBO) function for APLSM is (see detailed derivations
in the Supplementary Materials)
log 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
ELBO = −E𝑞(𝑼,𝑽,𝛼0 ,𝛼1 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [ ]
log 𝑝(𝑼, 𝑽, 𝒀𝑰 , 𝒀𝑰 𝑨 |𝛼0 , 𝛼1 )
∫
𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
=− 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑓 (𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
𝑁 𝑀 Î𝑁 Î𝑀
𝑖=1 𝑞(𝒖 𝒊 ) 𝑎=1 𝑞(𝒗 𝒂 )
∫ Ö Ö
=− 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ) log Î𝑁 Î𝑀 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑖=1 𝑎=1 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 ) 𝑖=1 𝑓 (𝒖 𝒊 ) 𝑎= 1 𝑓 (𝒗 𝒂 )
𝑁 ∫ 𝑀 ∫
∑︁ 𝑞(𝒖 𝒊 ) ∑︁ 𝑞(𝒗 𝒂 )
=− 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊 − 𝑞(𝒗 𝒂 ) log 𝑑𝒗 𝒂
𝑖=1
𝑓 (𝒖 𝒊 ) 𝑎=1
𝑓 (𝒗 𝒂 )
∫
+ 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑁
∑︁ 𝑀
∑︁
=− KL[𝑞(𝒖 𝒊 )| 𝑓 (𝒖 𝒊 )] − KL[𝑞(𝒗 𝒂 )| 𝑓 (𝒗 𝒂 )] + E𝑞(𝑼,𝑽,𝛼0 ,𝛼1 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )]
𝑖=1 𝑎=1
Í𝑁
1 2
𝑁 tr( Λ̃ )
0 𝑖=1 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊
= − 𝐷𝑁 log(𝜆 0 ) − 𝑁 log(det( Λ̃0 )) − 2
−
2 2𝜆 0 2𝜆20
11
Í𝑀
1 𝑀 tr( Λ̃ )
1 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
− 𝐷 𝑀 log(𝜆21 ) − 𝑀 log(det( Λ̃1 )) − 2
− + (𝑀 𝐷 + 𝑁 𝐷)
2 2𝜆 1 2𝜆21 2
+ E𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)]. (5)
After applying Jensen’s inequality (Jensen, 1906), a lower-bound on the third term is given
by,
𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼
− log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) . (6)
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2
The closed form update rules of the (𝑡 + 1)th iteration are as follows
VE-step: Estimate 𝒖˜ 𝒊 (𝑡+1) , 𝒗˜𝒂 (𝑡+1) , Λ̃0(𝑡+1) and Λ̃1(𝑡+1) by minimizing ELBO(𝑞(U), q(V), Θ)
" 𝑁 𝑀
# −1
1 ∑︁ ∑︁ 1
𝒖˜ 𝒊(𝒕+1) = 2
+ (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑦𝑖𝑎 ) 𝑰 + 𝑯 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) (7)
2𝜆 0 𝑗=1, 𝑗≠𝑖 𝑎=1
2
" 𝑁 𝑀
#
∑︁ ∑︁ 1 1
(𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 + 𝑦𝑖𝑎 𝒗˜(𝒕)
𝒂 − 𝑮 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) 𝒖˜ 𝒊(𝒕) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) )
𝑗=1, 𝑗≠𝑖 𝑎=1
2 2
(8)
" 𝑁
! 𝑁
# −1 " #
1) 1 ∑︁ 1 ∑︁ 1
𝒗˜(𝒕+
𝒂 = 2
+ 𝑦 𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 𝒗
(˜ (𝒕)
𝒂 ) 𝑦 𝑖𝑎 𝒖
˜ (𝒕)
𝒊 − 𝒗 (𝒕)
𝑮 𝒊𝒂 (˜ 𝒂 )
2𝜆 1 𝑖=1 2 𝑖=1
2
" 𝑁 𝑁 𝑁 𝑀
! # −1
𝑁 𝑁 1 ∑︁ ∑︁ ∑︁ ∑︁
Λ̃0(𝑡+1) = +2 𝑦𝑖 𝑗 + (𝑡)
𝑦𝑖𝑎 𝑰 + 𝑮 𝒊 ( Λ̃0 ) + 𝑮 𝒊𝒂 ( Λ̃0 ) (𝑡)
(9)
2 2 𝜆20 𝑖=1 𝑗=1 𝑖=1 𝑎=1
12
Algorithm 1: VBEM Estimation procedure
Input: Network Adjacency matrix 𝑌𝐼 , Multivariate Covariates matrix 𝑌𝐼 𝐴 , number of
dimensions 𝐷
˜0, 𝛼
Result: Model parameters Θ = 𝛼 ˜ 1 , and ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1
1: while 𝑡 < 𝑁iter and KLdis < .999999 do
2: ˜ 0 and 𝛼
Compute estimates 𝛼 ˜ 1 using Equations 11 and 12
3: Compute estimates Λ̃0 and Λ̃1 using Equations 9 and 10
4: Compute estimates ũ𝑖 and ṽ𝑎 using Equations 7 and 8
5: Compute estimates KL using Equations 5 and 6
6: KLdis ← KL𝑡 /KL𝑡−1
7: 𝑡 ← 𝑡+1
8: end while
9: return [ 𝛼
˜0, 𝛼
˜ 1 ,, ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 , KL]
" 𝑁 𝑀
! # −1
𝑀 𝑀 1 ∑︁ ∑︁
Λ̃1(𝑡+1) = + (𝑡)
𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃1 ) , (10)
2 2 𝜆21 𝑖=1 𝑎=1
˜ 0(𝑡+1) and 𝛼
M-step: Estimate 𝛼 ˜ 1(𝑡+1) with the following update rules,
˜ 0(𝑡) ) + 𝛼
˜ 0(𝑡) ℎ 𝐼 ( 𝛼
˜ 0(𝑡) )
Í𝑁 Í𝑁
𝑖=1
𝐼
𝑗=1 𝑦 𝑖 𝑗 − 𝑔𝐼 (𝛼
˜ 0(𝑡+1)
𝛼 = (11)
˜ 0(𝑡) )
ℎ𝐼 (𝛼
˜ 1(𝑡) ) + 𝛼
˜ 1(𝑡) ℎ 𝐼 𝐴 ( 𝛼
˜ 1(𝑡) )
Í𝑁 Í 𝑀
𝑖=1
𝐼𝐴
𝑎=1 𝑦 𝑖𝑎 − 𝑔𝐼 𝐴 (𝛼
˜ 1(𝑡+1)
𝛼 = , (12)
˜ 1(𝑡) )
ℎ𝐼 𝐴 (𝛼
˜ 0(𝑡) ) and 𝑔 𝐼 𝐴 ( 𝛼
where 𝑔 𝐼 ( 𝛼 ˜ 1(𝑡) ) are the partial derivatives (gradients) of 𝑭𝑰 and 𝑭𝑰 𝑨 with
˜ 0 and 𝛼
respect to 𝛼 ˜ 0(𝑡) , 𝛼
˜ 1 , evaluated at 𝛼 ˜ 1(𝑡) ; and ℎ 𝐼 ( 𝛼
˜ 0(𝑡) ) and ℎ 𝐼 𝐴 ( 𝛼
˜ 1(𝑡) ) are the second-order
partial derivatives of 𝑭𝑰 and 𝑭𝑰 𝑨 with respect to 𝛼
˜ 0 and 𝛼 ˜ 0(𝑡) , 𝛼
˜ 1 , evaluated at 𝛼 ˜ 1(𝑡) .
13
Figure 1: The distributions of the AAEs when 𝛼0 = 2 and 𝛼1 = 1.5.
4 Simulation Study
In this section, we conduct a simulation study to evaluate the proposed VBEM algo-
rithm’s performance for the APLSM. We compare APLSM with two baseline approaches,
namely LSM using variational inference proposed by Gollini and Murphy (2016), and LSM
for bipartite networks (BLSM) using variational inference developed in Section 2.2. We ex-
pect that with the inclusion of the information in 𝒀𝑰 𝑨 , the APLSM will exhibit a stronger
model recovery fit than the LSM for the social network. Similarly, we expect the APLSM
to be more successful in model recovery than the BLSM due to the addition of information
from 𝒀𝑰 .
To assess whether the proposed algorithm can recover the true link probabilities, we use
the average absolute error (AAE) between the true link probabilities and the estimated link
probabilities as a metric. The smaller the AAEs, the closer the estimated link probabilities
are to the true link probabilities, indicating better model recovery. To assess whether
the proposed algorithm can recover true 𝛼 values, we look at the differences between the
estimated values and the true values. If the differences are concentrated around mean
0 with a small variance, it will indicate a good recovery. Finally, we assess the fit of the
estimated latent positions by calculating the proportions of the pairwise distances based on
14
Figure 2: Distributions of the distances between the true and estimated 𝛼0 (left), 𝛼1 (right) and when
𝛼0 = 0.5 and 𝛼1 = 0.
the estimated latent positions to the pairwise distances based on the true latent positions.
Even though the true latent positions cannot be exactly recovered in any latent space
model due to unidentifiability, we expect a successful algorithm to be able to recover the
true distances Sewell and Chen (2015). If the estimated latent positions preserve the nodes’
relative positions in the latent space, we expect the proportion of the estimated pairwise
distances to the true pairwise distances to be close to 1.
The design of the simulation is as follows. We first generate data following Equation 4.
To do this, we set 𝜆 0 and 𝜆 1 to be 1 and the number of attributes and persons to be 50. We
sample 𝒁 𝑰 and 𝒁 𝑨 from the multivariate normal distributions using the above parameter
values. We produce true link probabilities between persons and between attributes and
persons using different sets of 𝛼 values. Different 𝛼 values are associated with different
densities of the data. Then, we generate 𝒀𝑰 and 𝒀𝑰 𝑨 matrices. Each entry of the matrices
is independently generated from the Bernoulli distribution using the corresponding link
probability. We apply the APLSM with the VBEM estimator to the generated data and
obtain the latent positions’ posterior distributions and estimates for the fixed parameters.
We use the posterior means as the point estimates of the latent positions to obtain the
estimated probabilities. We also fit LSM to 𝒀𝑰 and BLSM to 𝒀𝑰 𝑨 . Using the posterior
means as point estimates of the latent positions, we obtain the estimated probabilities for
𝒀𝑰 and 𝒀𝑰 𝑨 , respectively. We repeat this process 200 times and report the results.
15
Figure 3: Distributions of pairwise distance ratios, comparing 𝒖˜𝒊 with 𝒖𝒊 (left) and 𝒗˜𝒂 with 𝒗 𝒂 (right)
when 𝛼0 = −1 and 𝛼1 = 0.5.
In Figure 1, we present the boxplots over 200 simulations of averages of the absolute
differences (AAEs) across entries in 𝒀𝑰 and 𝒀𝑰 𝑨 for APLSM, LSM and BLSM when 𝛼0 = 2
and 𝛼1 = 1.5. We see that the AAEs are generally smaller when an APLSM is fitted to the
data compared to AAEs when an LSM is fitted. Therefore there is a strong improvement in
model recovery using APLSM for 𝒀𝑰 compared with using LSM. Clearly, this improvement
results from the added attribute information that consolidates the latent person position
estimates. Similarly, the AAEs are smaller based on fitting the APLSM than AAEs based
on fitting the BLSM. Again, this suggests a big improvement in model recovery using
APLSM for 𝒀𝑰 𝑨 than using BLSM. In this case, the added social network consolidates the
latent person position estimates resulting in a better model recovery for APLSM.
In Figure 2, we present the distributions of the differences between the estimated 𝛼
values and the true 𝛼 values. The true 𝛼 values are .5 and 0 for 𝛼0 and 𝛼1 . The distribution
of the differences for each 𝛼 is centered around 0, indicating little bias in the 𝛼 estimates.
Both also are relatively narrow, implying that the estimated 𝛼 values are precise and close
to the true 𝛼 values.
Finally, in Figure 3, we compare the pairwise distances based on the estimated latent
positions with those based on the true latent positions. The distance between nodes 𝑖 and
𝑗 using estimated 𝒖˜ 𝒊 and 𝒖˜ 𝒋 should be close to the true distance between nodes 𝑖 and 𝑗
using 𝒖 𝒊 and 𝒖 𝒋 if the VBEM estimation algorithm successfully maintains and recovers the
relationship between nodes 𝑖 and 𝑗. As shown in both plots in Figure 3, the distributions
of the ratios of the estimated and true pairwise distances are narrow and centered around
1, implying satisfactory recoveries of the nodes’ relationships with each other through the
16
estimated latent positions.
5 Application
We apply APLSM to the French financial elite dataset collected by Kadushin and de
Quillacqi to study the friendship network among top financial elites in France during the
last years of the Socialist government Kadushin (1995). We first introduce the data and
then describe our results in detail.
Data
The data were collected through interviews for people who held leading positions in
major financial institutions and frequently appeared in financial section press reports. The
friendship information was collected by asking the interviewees to name their friends in the
social context. Kadushin and de Quillacqi then identified an inner circle of 28 elites from the
initial sample based on their influence and their perceived eliteness by other participants.
The resulting friendship network is a symmetric adjacency matrix.
The data also contains additional background information, including age, a complex set
of post-secondary education experiences, place of birth, political and cabinet careers, po-
litical party preference, religion, current residence, and club memberships. Two aspects of
the elites’ ”prestige” include whether the person is named in the social register and whether
the person has a particle (”de”) in front of either his (no woman was in the inner circle),
his wife’s, his mother’s, or his children’s names. Having ”de” in the name is associated
with nobility. Father’s occupation is one of the variables used to reflect an elite’s social
class. Fathers’ occupation is considered “high” if the father is in higher management, a
professional, an industrialist, or an investor. Unfortunately, upon communications with
the original author, we found that the coding procedures regarding some variables have
been lost, including Finance Ministry information, religion, etc. We end up with 13 binary
variables including information on education (“Science Po”, “Polytechniqu”, “University”
and “Ecole Nationale d’Administration”), career (“Inspection General de Finance” and
“Cabinet”), class (“Social Register”, “Father Status”, “Particule”), politics (“Socialist”,
“Capitalist” and “Centrist”) and “Age” after excluding the lost or the unrelated infor-
mation, i.e., mason and location, which are not associated with the social network based
on Kadushin (1995) (location is not considered to be related to the social network after
adjusting for multiple comparisons). “Age” was converted into a binary variable following
17
Kadushin (1995), where a group of elites was considered of older age with an average birth
year of 1938. We will use ENA as an abbreviation for Ecole Nationale d’Administration.
The science Po and the other educational variables warrant further explanations. The
Science Po or the Institut d’Etudes Politiques de Paris prepares students for the entrance
exam of the ENA. An alternative of the Science Po is the (Ecole) Polytechnique, a French
military school whose graduates often enter one of the technical ministries. These elites
with Polytechnique degrees enter one of the technical ministries. Both the Science Po
and the Polytechnique are called Grandes Ecoles. A Grandes Ecoles education is highly
respected in France as it leads to membership in the ENA, where the grands corps, which
are the French civil service elites, including the Inspection General de Finance, etc-recruit
its members (Kadushin, 1995).
Previous Approaches
The authors in Kadushin (1995) first used multidimensional scaling to draw the friend-
ship network’s sociogram. Then they applied Quadratic Assignment Procedure regressions
and correlations to test each background variable’s association with the social network.
Based on the social network, two clusters were identified, which the authors called the left
and the right moieties. The dependence between the social network and background infor-
mation was understood through comparisons of the elites between the left and the right
moiety. The elites in the right moiety were found to have a higher social class (upper-class
parentage with high social standing), to be older (average birth year of 1929), and to have
fewer appointments in public offices. The left moiety elites were more likely to be ENA
graduates, grand corps members, cabinet members, treasury service members, socialists,
and younger (average birth year of 1938).
Using the APLSM, we will construct a joint latent space which will allow us to jointly
model elites’ friendship connections and their background information. Using the APLSM,
we will also replicate Kadushin (1995)’s left and the right moiety, adding simultaneous inter-
pretation for the division in the elite circle. Furthermore, we observe an additional division
within the left moiety using APLSM, which provides opportunities for new hypotheses.
18
Figure 4: The estimated latent person and attribute positions, 𝒖˜𝒊 and 𝒗˜𝒂 based on the BLSM (left) and
the APLSM (right). The white ellipses represent 80% approximate credible intervals for the ṽ𝑎 , and the
grey ellipses represent 80% approximate credible intervals for the ũ𝑖 . The latent positions of the attributes
Center, Right, Social register, ENA and Socialist are colored as blue, green, green, purple and red. The
black edges represent the friendship edges.
the variational inference proposed in Gollini and Murphy (2016). For the multivariate
covariates, we fit the bipartite latent space model (BLSM) using the variational inference
method we developed in Section 2.2.
In Figure 4, we present the estimated latent person and attribute positions, 𝒖˜ 𝒊 and 𝒗˜𝒂
from fitting the BLSM (left) and from fitting the APLSM (right). In Figure 5, we exclusively
compare the resulting latent person positions using only the social network, 𝒀𝑰 (left), only
the multivariate covariates, 𝒀𝑰 𝑨 (middle), and both the social network and the multivariate
covariates, 𝒀𝑰 and 𝒀𝑰 𝑨 (right). An angular rotation of the latent friendship space is applied
to match the latent person positions in the joint latent space. A congruence coefficient of
0.96 was found between the two sets of latent positions. Used as a measure of dimension
similarity, a congruence coefficient of 0.95 and above indicates that the dimensions are
identical. A congruence coefficient of 0.91 was found between the latent person positions
using the BLSM and the latent person positions using the APLSM. As expected, the latent
person positions’ structure using APLSM more closely resembles the structure using LSM
than that using BLSM.
19
Figure 5: The estimated latent person positions 𝒖˜𝒊 in the friendship latent space using only the social
network 𝒀𝑰 (left), using only the multivariate covariates 𝒀𝑰 𝑨 (middle) and in the joint latent space using
both 𝒀𝑰 and 𝒀𝑰 𝑨 (right). The edges are the observed friendship connections. The grey ellipses represent
80% approximate credible intervals for the ũ𝑖 . The numbers represent the randomly assigned indices for
the French elites.
20
Assess Model Fit
To assess the model fit of APLSM to the data, we obtain the area under the receiver
operating characteristic curve (AUC) of predicting the presence or absence of a link from
the estimated link probabilities. The receiver operating characteristic curves (ROCs) and
the AUC values for 𝒀𝑰 and 𝒀𝑰 𝑨 are presented in Figure 6. The results show satisfactory
fit for both matrices. In addition, we assess whether the APLSM captures the structure of
the social network and the dependencies among the attributes and the persons’ attributes
information in general.
In Figure 7, we assess whether the APLSM can capture the friendship structure in the
social network. In the left panel, we compare the elite’s distances to the center with the
sum totals of their friendship connections. We plot the rankings of the distances to center
against the rankings of the total friendship connections. In this way, we assess whether the
APLSM preserves the elites’ social hierarchy without being distracted by the distribution
differences between Euclidean distances and friendship counts. A solid reference line is
drawn to illustrate the relationship between the two. The intercept for the solid line equals
to 𝑁, the highest ranking, and the slope equals to −1. As the solid line roughly goes through
the center of the scatter plot, we know that the rankings of the distances to the center
mirror the rankings of the friendship counts in the opposite direction. The Spearman’s
rank correlation between the Euclidean distances and friendship counts is −0.53.
In the right panel of Figure 7, we present the box plot of elites’ pairwise differences in the
absence of friendship versus when they are friends. We can see that the pairwise distances
are generally smaller between elites who are friends with a median of 0.75, compared to
the pairwise distances between elites who are not friends with a median of 1.60. This
observation shows that pairwise distances between elites using the APLSM distinguish
between whether the elites are friends or not.
In Figure 8, we assess the APLSM’s ability to capture the dependencies among at-
tributes. In the left panel, we plot the rankings of attributes’ distances to center against
the rankings of the attribute sum scores. As can be seen, the two types of rankings roughly
follow the solid line with an intercept of 𝑀, the highest ranking, and a slope of −1, sug-
gesting a mirroring of the two rankings in the opposite direction. The Spearman’s rank
correlation between the attribute’s distances to center and attribute sum scores is −0.92.
This observation suggests that attributes’ distances to the center capture the overall re-
sponse rates of the attributes. In the right panel of Figure 8, we plot the rankings of the
pairwise distances against the rankings of the attribute correlation values. As can be seen,
the two types of rankings center around the dashed line with an intercept of 80, the highest
21
Figure 7: Assess the fit of the person latent positions. The left panel shows the rankings of nodes’ distances
to center against the rankings of the nodes’ total friendship connections. The solid line is a reference line
with an intercept of 𝑁 and a slope of −1. The right panel shows the pairwise distances between pairs of
nodes when they are friends versus when they are not friends.
ranking, and a slope of −1, suggesting a mirroring of the two rankings in the opposite
direction; the Spearman’s rank correlation between the two is −0.62. This result shows
that the pairwise distances between attributes using the APLSM capture the correlations
between the attributes.
Using Figure 9 (left), we assess whether the APLSM is able to differentiate the presence
of an attribute from the absence. We compare the pairwise distances between attributes
and persons when the attribute is present to the pairwise distances between persons and
attributes when the attribute is absent. We note that when the attributes are present,
the distances are generally smaller with a median of 1.13, than those when the attributes
are absent with a median of 1.71. Therefore the pairwise distances between persons and
attributes using the APLSM distinguish the present from the absence of attributes.
To replicate the left and right moiety in Kadushin (1995), we apply K-means clustering
(Hartigan et al., 1979) to the latent person and attribute positions with 𝑘 = 2. The
algorithm partitions the latent positions into 𝑘 groups which minimize the sum of squares
within each group. The k-means clustering is performed with the K-means function in the
kknn package, which implements the Hartigan–Wong algorithm (Hechenbichler and Schliep,
22
Figure 8: Assess the fit of the attribute latent positions. The left panel shows the rankings of attributes’
distances to center against the rankings of the attributes’ sum scores. The right panel shows the rankings
of attributes’ pairwise distances against the rankings of the attributes’ correlations.
2004). We run the K-means algorithm with 100 random starting positions and take the
solution which optimizes the objective function. The resulting two clusters are shown in
Figure 10 explaining 50.4% of the total variance in the data (defined as the proportion of
the within-cluster sum of squares to the total sum of squares). The two clusters contain
both attributes and persons.
In Figure 10, we show the latent person and attribute positions colored by the resulting
two clusters. The solid lines represent the observed friendship connections between pairs
of elites. We can see that a more densely connected network can be found on the left side
than the right. Figure 9 (right) shows the probability of a pair of elites being friends in each
cluster and between the two clusters. The median probability of friendship connections in
the left cluster is roughly 0.29; the median probability in the right cluster is roughly 0.22;
while between the two clusters, the median probability is roughly 0.05. This result shows
that there is slightly stronger connectivity among elites in the left cluster than in the right
cluster. Also, there is a separation of French elites based on social connectivity into the
two clusters.
In Figure 10, we also see that attributes: Socialists, Inspector General de France, ENA,
Cabinet, University, Polytechnique and Center (Centrists) are found in the left (red) cluster,
suggesting that the elites in this cluster are more likely to obtain higher education, hold
top office positions and identify as socialists or centrists. On the other hand, attributes
23
Figure 9: (A) The pairwise distances between attributes and persons when the attributes are present
versus when the attributes are absent. (B) The box plot showing the probability of a pair of elites being
friends in the left and the right clusters and between the two clusters.
such as Right (Capitalist), Father Status, Social Register, Particule, Science Po, and Age
are found in the right (green) cluster, which suggest that the elites in the right cluster are
more likely to be of higher class (being present in the social register, having particule in
their names and fathers of higher status), older age and identify as politically right.
Two clusters based on APLSM largely correspond to the left and right moieties found
in Kadushin (1995). However, in contrast to Kadushin (1995), the two clusters here are
joint clusters with both attributes and persons. The addition of the attributes provides
more meaning to the division of french elites. We see the division of the social circles
based on party affiliations, class, education, and career. This finding was also observed
in the previous study though Kadushin (1995) compared the background information for
elites between the left and right moiety after the analysis of the network (The 12.1% of the
variance in only four attribute variables was explained through regression). Because we are
able to use relevant attribute information in the clustering process, there is more confidence
in our observation of the division. Besides, the approximate credible intervals of the person
and attribute nodes using the APLSM is much smaller than those using the BLSM and the
LSM. The drivers behind the division automatically and systematically emerge from the
data.
24
Figure 10: The division of the French financial elites into two clusters. The left panel displays the two
clusters of elites with positions of key attributes. The right panel displays latent attribute positions. Both
the latent person and latent attribute positions are part of the joint latent space shown in Figure 4. The
ellipses represent 80% approximate credible intervals for the latent positions. The solid lines represent the
observed friendship connections between pairs of elites.
The APLSM captures how social relations between French elites and their career, pol-
itics, and class information are related. In the previous section, we have replicated the
division in the French financial elites between the left and right moiety. Though not men-
tioned in the previous study Kadushin (1995), we believe that the presence of a third cluster
might be justified given the visible separation between the ENA attribute and the Poly-
techniqu attribute in the left moiety and our prior knowledge about the French education
system. Therefore, we run the k-means clustering algorithm with 𝑘 = 3. The algorithm is
again implemented with the kmeans function in the kknn P package (Hechenbichler and
Schliep, 2004) with 100 random starting positions. Approximately 63.9% of the total vari-
ance is explained by the three clusters, with 13.5% of the variance explained by the added
cluster. As before, the resulting three clusters contain both attributes and persons.
Figure 11 displays the estimated latent attribute and person positions (same as before)
colored by the resulting three clusters. The additional (blue) cluster is found with the
Polytechnique and Center attributes, named the Polytechnique cluster. The red cluster,
part of the left moiety, is centered around ENA attributes, called the ENA cluster. The
Science Po attribute is now part of the ENA (red) cluster. The green cluster, part of the
right moiety, is again centered around high-class attributes, called the HighClass cluster.
25
Figure 11: The division of the French financial elites into three clusters. The left panel displays the three
clusters of elites with positions of key attributes. The right panel displays latent attribute positions. Both
the latent person and latent attribute positions are part of the joint latent space. The ellipses represent
80% approximate credible intervals for the latent positions. The solid lines represent the observed friendship
connections between pairs of elites.
26
Figure 12: Boxplots showing the probabilities of feature attributes being indicated as positive in each of
the three clusters.
27
Figure 13: The boxplot showing the posterior probability of an edge in the social network between elites
in the same HighClass, ENA and Polytechniqu clusters and between each pair of clusters
from the same cluster and with someone of from each of the other two clusters. The me-
dian probabilities of friendship connections within the HighClass, ENA, and Polytechniqu
clusters are 0.36, 0.32, and 0.36 respectively. The median probabilities between clusters are
as follows: 0.06 between the HighClass cluster and the Polytechniqu cluster, 0.02 between
the HighClass cluster and the ENA cluster, and 0.11 between the ENA cluster and the
Polytechniqu cluster. Clearly, there is more intra-cluster connectivity than between-cluster
connectivity for all clusters. We observe the strongest connectivity between the elites in
the ENA cluster and the Polytechniqu cluster and weakest connectivity between elites in
the ENA cluster and the HighClass cluster, which makes sense because one either has the
position because of one’s education (ENA) or because of the connections of one’s family.
The left panel of Figure 14 displays the latent person positions of socialists (red),
centrists (blue), and capitalists (green) elites along with the latent attribute position of
attributes Socialist, Center, and Right. Elites with no party affiliations are shown as gray
circles in the joint latent space. As can be seen, elites with different party affiliations are
positioned near the corresponding attributes and are far apart in the joint latent space.
The right panel of Figure 14 displays the latent positions of ENA graduates (purple) and
non-ENA graduates (white) along with the attribute ENA. Again, we see a clear separation
28
Figure 14: The latent person positions colored by key attributes: party (left) and ENA (right). The left
panel displays the latent positions of socialist (red), centrist (blue) and capitalist (green) elites along with
the attributes Socialist, Center and Right. Four French elites indicate no party affiliations and are shown
as gray circles in the joint latent space. The right panel displays the latent positions of ENA graduates
(purple) along with the attribute ENA. Non-ENA graduates are shown as white circles in the joint latent
space. The ellipses represent 80% approximate credible intervals for the latent positions.
between ENA graduates and non-ENA graduates as the ENA graduates center around the
ENA attribute.
29
Supplementary Materials
The supplementary materials contain detailed derivations of the variational bayesian
EM algorithm for the proposed model, the APLSM, parts of which are used to estimate
the BLSM.
References
Agarwal, A., and Xue, L. (2020), “Model-based clustering of nonparametric weighted networks with ap-
plication to water pollution analysis,” Technometrics, 62, 161–172.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008), “Mixed membership stochastic
blockmodels,” Journal of Machine Learning Research, 9, 1981–2014.
Albert, R., and Barabási, A.-L. (2002), “Statistical mechanics of complex networks,” Reviews of modern
physics, 74, 47.
Arroyo, J., Athreya, A., Cape, J., Chen, G., Priebe, C. E., and Vogelstein, J. T. (2019), “Inference for
multiple heterogeneous networks with a common invariant subspace,” arXiv preprint arXiv:1906.10026.
Attias, H. (1999), “Inferring parameters and structure of latent variable models by variational Bayes,”
in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann
Publishers Inc., pp. 21–30.
Austin, A., Linkletter, C., and Wu, Z. (2013), “Covariate-defined latent space random effects model,”
Social networks, 35, 338–346.
Barabási, A.-L., and Albert, R. (1999), “Emergence of scaling in random networks,” science, 286, 509–512.
Barbillon, P., Donnet, S., Lazega, E., and Bar-Hen, A. (2015), “Stochastic block models for multiplex
networks: an application to networks of researchers,” arXiv preprint arXiv:1501.06444.
Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970), “A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains,” The annals of mathematical statistics,
41, 164–171.
Beal, M. J., Ghahramani, Z. et al. (2006), “Variational Bayesian learning of directed graphical models with
hidden variables,” Bayesian Analysis, 1, 793–831.
Beal, M. J. et al. (2003), Variational algorithms for approximate Bayesian inference, university of London
London.
Bickel, P. J., and Chen, A. (2009), “A nonparametric view of network models and Newman–Girvan and
other modularities,” Proceedings of the National Academy of Sciences, 106, 21068–21073.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017), “Variational inference: A review for statisticians,”
Journal of the American Statistical Association, 112, 859–877.
Boccaletti, S., Bianconi, G., Criado, R., Del Genio, C. I., Gómez-Gardeñes, J., Romance, M., Sendina-
Nadal, I., Wang, Z., and Zanin, M. (2014), “The structure and dynamics of multilayer networks,” Physics
Reports, 544, 1–122.
Borgatti, S. P., Mehra, A., Brass, D. J., and Labianca, G. (2009), “Network analysis in the social sciences,”
science, 323, 892–895.
30
Bramoullé, Y., Djebbari, H., and Fortin, B. (2009), “Identification of peer effects through social networks,”
Journal of econometrics, 150, 41–55.
Bullmore, E., and Sporns, O. (2009), “Complex brain networks: graph theoretical analysis of structural
and functional systems,” Nature Reviews Neuroscience, 10, 186–198.
Carrington, P. J., Scott, J., and Wasserman, S. (2005), Models and methods in social network analysis,
vol. 28, Cambridge university press.
Celisse, A., Daudin, J. J., and Pierre, L. (2012), “Consistency of maximum-likelihood and variational
estimators in the stochastic block model,” Electronic Journal of Statistics, 6, 1847–1899.
D’Angelo, S., Alfò, M., and Murphy, T. B. (2018), “Node-specific effects in latent space modelling of
multidimensional networks,” in 49th Scientific meeting of the Italian Statistical Society.
Daudin, J. J., Picard, F., and Robin, S. (2008), “A mixture model for random graphs,” Stat Comput, 18,
173–183.
Dean, D. O., Bauer, D. J., and Prinstein, M. J. (2017), “Friendship Dissolution Within Social Networks
Modeled Through Multilevel Event History Analysis,” Multivariate behavioral research, 52, 271–289.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum likelihood from incomplete data via
the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
Dorans, N., and Drasgow, F. (1978), “Alternative weighting schemes for linear prediction,” Organizational
Behavior and Human Performance, 21, 316–345.
Ferrara, E., Interdonato, R., and Tagarelli, A. (2014), “Online popularity and topical interests through the
lens of instagram,” in Proceedings of the 25th ACM conference on Hypertext and social media, ACM,
pp. 24–34.
Fosdick, B. K., and Hoff, P. D. (2015), “Testing and modeling dependencies between a network and nodal
attributes,” Journal of the American Statistical Association, 110, 1047–1056.
Frank, K. A., Zhao, Y., and Borman, K. (2004), “Social capital and the diffusion of innovations within
organizations: The case of computer technology in schools,” Sociology of education, 77, 148–171.
Fratiglioni, L., Wang, H.-X., Ericsson, K., Maytan, M., and Winblad, B. (2000), “Influence of social network
on occurrence of dementia: a community-based longitudinal study,” The lancet, 355, 1315–1319.
Friel, N., Rastelli, R., Wyse, J., and Raftery, A. E. (2016), “Interlocking directorates in Irish companies
using a latent space model for bipartite networks,” Proceedings of the National Academy of Sciences,
113, 6629–6634.
Fujimoto, K., Wang, P., and Valente, T. W. (2013), “The decomposed affiliation exposure model: A
network approach to segregating peer influences from crowds and organized sports,” Network Science,
1, 154–169.
Girvan, M., and Newman, M. E. (2002), “Community structure in social and biological networks,” Pro-
ceedings of the national academy of sciences, 99, 7821–7826.
Goldsmith-Pinkham, P., and Imbens, G. W. (2013), “Social networks and the identification of peer effects,”
Journal of Business & Economic Statistics, 31, 253–264.
Gollini, I., and Murphy, T. B. (2016), “Joint modeling of multiple network views,” Journal of Computational
and Graphical Statistics, 25, 246–265.
Guhaniyogi, R., Rodriguez, A. et al. (2020), “Joint modeling of longitudinal relational data and exogenous
variables,” Bayesian Analysis, 15, 477–503.
Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007), “Model-based clustering for social networks,”
31
J. Roy. Statist. Soc. Ser. A, 170, 301–354.
Hartigan, J., Wong, M. et al. (1979), “A k-means clustering algorithm,” New Haven.
He, X., Kan, M.-Y., Xie, P., and Chen, X. (2014), “Comment-based multi-view clustering of web 2.0
items,” in Proceedings of the 23rd international conference on World wide web, ACM, pp. 771–782.
Hechenbichler, K., and Schliep, K. (2004), “Weighted k-nearest-neighbor techniques and ordinal classifica-
tion,” .
Henderson, H. V., and Searle, S. R. (1981), “On deriving the inverse of a sum of matrices,” Siam Review,
23, 53–60.
Hoff, P. (2008), “Modeling homophily and stochastic equivalence in symmetric relational data,” in Advances
in neural information processing systems, pp. 657–664.
Hoff, P. D. (2005), “Bilinear mixed-effects models for dyadic data,” Journal of the american Statistical
association, 100, 286–295.
— (2009), “Multiplicative latent factor models for description and prediction of social networks,” Compu-
tational and mathematical organization theory, 15, 261.
— (2018), “Additive and multiplicative effects network models,” arXiv preprint arXiv:1807.08038.
Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002), “Latent space approaches to social network
analysis,” Journal of the american Statistical association, 97, 1090–1098.
Huang, W., Liu, Y., Chen, Y. et al. (2020), “Mixed membership stochastic blockmodels for heterogeneous
networks,” Bayesian Analysis, 15, 711–736.
Jackson, M. O. et al. (2008), Social and economic networks, vol. 3, Princeton University Press Princeton.
Jensen, J. L. W. V. (1906), “Sur les fonctions convexes et les inégalités entre les valeurs moyennes,” Acta
mathematica, 30, 175–193.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999), “An introduction to variational
methods for graphical models,” Machine learning, 37, 183–233.
Kadushin, C. (1995), “Friendship among the French financial elite,” American Sociological Review, 202–
221.
Kéfi, S., Miele, V., Wieters, E. A., Navarrete, S. A., and Berlow, E. L. (2016), “How structured is the
entangled bank? The surprisingly simple organization of multiplex ecological networks leads to increased
persistence and resilience,” PLoS biology, 14, e1002527.
Kim, Y., and Srivastava, J. (2007), “Impact of social influence in e-commerce decision making,” in Pro-
ceedings of the ninth international conference on Electronic commerce, ACM, pp. 293–302.
Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J. P., Moreno, Y., and Porter, M. A. (2014), “Multilayer
networks,” Journal of Complex Networks, 2, 203–271.
Krivitsky, P. N., and Handcock, M. S. (2008), “Fitting position latent cluster models for social networks
with latentnet,” Journal of Statistical Software, 24.
Krivitsky, P. N., Handcock, M. S., Raftery, A. E., and Hoff, P. D. (2009), “Representing degree distribu-
tions, clustering, and homophily in social networks with latent cluster random effects models,” Social
networks, 31, 204–213.
Kwon, K. H., Stefanone, M. A., and Barnett, G. A. (2014), “Social network influence on online behavioral
choices: exploring group formation on social network sites,” American Behavioral Scientist, 58, 1345–
1360.
Lazer, D. (2011), “Networks in political science: Back to the future,” PS: Political Science & Politics, 44,
32
61–68.
Leenders, R. T. A. (2002), “Modeling social influence through network autocorrelation: constructing the
weight matrix,” Social networks, 24, 21–47.
Liu, X., Liu, W., Murata, T., and Wakita, K. (2014), “A framework for community detection in heteroge-
neous multi-relational networks,” Advances in Complex Systems, 17, 1450018.
Lusher, D., Koskinen, J., and Robins, G. (2013), Exponential random graph models for social networks:
Theory, methods, and applications, Cambridge University Press.
Ma, Z., Ma, Z., and Yuan, H. (2020), “Universal Latent Space Model Fitting for Large Networks with
Edge Covariates.” Journal of Machine Learning Research, 21, 1–67.
Matias, C., and Miele, V. (2016), “Statistical clustering of temporal networks through a dynamic stochastic
block model,” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
McPherson, M., Smith-Lovin, L., and Cook, J. M. (2001), “Birds of a feather: Homophily in social net-
works,” Annual review of sociology, 27, 415–444.
Mele, A., Hao, L., Cape, J., and Priebe, C. E. (2019), “Spectral inference for large Stochastic Blockmodels
with nodal covariates,” arXiv preprint arXiv:1908.06438.
Mercken, L., Snijders, T. A., Steglich, C., Vartiainen, E., and De Vries, H. (2010), “Dynamics of adolescent
friendship networks and smoking behavior,” Social networks, 32, 72–81.
Mucha, P. J., Richardson, T., Macon, K., Porter, M. A., and Onnela, J. P. (2010), “Community structure
in time-dependent, multiscale, and multiplex networks,” Science, 328, 876–878.
Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E. (2016), “A review of relational machine learning
for knowledge graphs,” Proceedings of the IEEE, 104, 11–33.
Paul, S., and Chen, Y. (2016), “Consistent community detection in multi-relational data through restricted
multi-layer stochastic blockmodel,” Electronic Journal of Statistics, 10, 3807–3870.
Paul, S., Chen, Y. et al. (2020a), “A random effects stochastic block model for joint community detection
in multiple networks with applications to neuroimaging,” Annals of Applied Statistics, 14, 993–1029.
— (2020b), “Spectral and matrix factorization methods for consistent community detection in multi-layer
networks,” The Annals of Statistics, 48, 230–250.
Robins, G., Pattison, P., and Elliott, P. (2001), “Network models for social influence processes,” Psychome-
trika, 66, 161–189.
Rubinov, M., and Sporns, O. (2010), “Complex network measures of brain connectivity: uses and inter-
pretations,” Neuroimage, 52, 1059–1069.
Salter-Townshend, M., and McCormick, T. H. (2017), “Latent space models for multiview network data,”
The annals of applied statistics, 11, 1217.
Salter-Townshend, M., and Murphy, T. B. (2013), “Variational Bayesian inference for the latent position
cluster model for network data,” Computational Statistics & Data Analysis, 57, 661–671.
Sengupta, S., and Chen, Y. (2015), “Spectral clustering in heterogeneous networks,” Statistica Sinica,
1081–1106.
Sewell, D. K.— (2015), “Latent space models for dynamic networks,” Journal of the American Statistical
Association, 110, 1646–1657.
Sewell, D. K.— (2016), “Latent space models for dynamic networks with weighted edges,” Social Networks,
44, 105–116.
Shalizi, C. R., and Thomas, A. C. (2011), “Homophily and contagion are generically confounded in obser-
33
vational social network studies,” Sociological methods & research, 40, 211–239.
Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002), “Probabilistic Boolean networks: a
rule-based uncertainty model for gene regulatory networks,” Bioinformatics, 18, 261–274.
Sun, Y., Yu, Y., and Han, J. (2009), “Ranking-based clustering of heterogeneous information networks with
star network schema,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining, ACM, pp. 797–806.
Sweet, T., and Adhikari, S. (2020), “A Latent Space Network Model for Social Influence,” Psychometrika,
1–24.
Sweet, T. M. (2015), “Incorporating covariates into stochastic blockmodels,” Journal of Educational and
Behavioral Statistics, 40, 635–664.
VanderWeele, T. J. (2011), “Sensitivity analysis for contagion effects in social networks,” Sociological
Methods & Research, 40, 240–255.
VanderWeele, T. J., and An, W. (2013), “Social networks and causal inference,” in Handbook of causal
analysis for social research, Springer, pp. 353–374.
Wasserman, S., Faust, K. et al. (1994), Social network analysis: Methods and applications, vol. 8, Cambridge
university press.
Watts, D. J., and Strogatz, S. H. (1998), “Collective dynamics of ‘small-world’networks,” nature, 393, 440.
Xu, K. S., Kliger, M., and Hero Iii, A. O. (2014), “Adaptive evolutionary clustering,” Data Mining and
Knowledge Discovery, 28, 304–336.
Zhang, X., Xue, S., and Zhu, J. (2020), “A Flexible Latent Space Model for Multilayer Networks,” in
International Conference on Machine Learning, PMLR, pp. 11288–11297.
34
Supplementary Material for ”Joint Latent Space Model for So-
cial Networks with Multivariate Attributes”.
The Kullback-Leiber divergence between the variational posterior and the true posterior
is:
35
𝑁 ∫ !
∑︁ 1 1
=− 𝑞(𝒖 𝒊 ) − 𝐷 log(𝜆20 ) + log(det( Λ̃0 )) − 2 𝒖 𝒊 𝑇 𝒖 𝒊 + (𝒖 𝒊 − 𝒖˜𝒊 )𝑇 Λ̃−0 1 (𝒖 𝒊 − 𝒖˜𝒊 )
𝑖=1
2 𝜆0
𝑁
!
1 ∑︁ 1 1
= 𝐷𝑁 log(𝜆20 ) − 𝑁 log(det( Λ̃0 )) + 𝐸
2 𝑞(𝒖 𝒊 )
[𝒖 𝒊 𝑇 𝒖 𝒊 ] − 𝐸 𝑞(𝒖𝒊 ) [(𝒖 𝒊 − 𝒖˜𝒊 )𝑇 Λ̃−0 1 (𝒖 𝒊 − 𝒖˜𝒊 )]
2 𝑖=1
2 𝜆 0
𝑁
1 ∑︁ 1 2 1
= 𝐷𝑁 log(𝜆20 ) − 𝑁 log(det( Λ̃0 )) + 2
Var(𝒖 𝒊 ) + 𝐸 𝑞(𝒖 𝒊 ) [𝒖 𝒊 ] − 𝑁𝐷
2 𝑖=1
2𝜆 0
2
1 𝑁 tr( Λ̃ ) Í𝑁 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊 1
0
2
= 𝐷𝑁 log(𝜆 0 ) − 𝑁 log(det( Λ̃0 )) + 2
+ 𝑖=1 2 − 𝑁𝐷
2 2𝜆 0 2𝜆 0 2
𝑀
∑︁
KL[𝑞(𝒗 𝒂 )|| 𝑓 (𝒗 𝒂 )]
𝑎=1
Í𝑀
1 2
𝑀 tr( Λ̃ )
1 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
= 𝐷 𝑀 log(𝜆 1 ) − 𝑀 log(det( Λ̃1 )) + 2
+ − 𝑀𝐷
2 2𝜆 1 2𝜆21 2
𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] can be expanded into 6 components:
First 2 components of 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] are calculated as follows:
𝑁
∑︁ 𝑁
∑︁
𝑦𝑖 𝑗 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 ]
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁
∑︁ 𝑁
∑︁ ∫
= 𝑦𝑖 𝑗 𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
36
𝑁 𝑁
" ∫ #
∑︁ ∑︁
= ˜0 −
𝑦𝑖 𝑗 𝛼 (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 𝑁
" 𝐷
∫ ∑︁ #
∑︁ ∑︁
= ˜0 −
𝑦𝑖 𝑗 𝛼 (𝑢𝑖𝑑 − 𝑢 𝑗 𝑑 ) 2 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
∑︁ ∫ ∫ ∫ ∫ #
∑︁ ∑︁
2
𝑢 2𝑗 𝑑 𝑞(𝑢 𝑗 𝑑 )𝑑𝑢 𝑗 𝑑 −
= ˜0 −
𝑦𝑖 𝑗 𝛼 𝑢𝑖𝑑 𝑞(𝑢𝑖𝑑 )𝑑𝑢𝑖𝑑 + 2𝑢𝑖𝑑 𝑢 𝑗 𝑑 𝑞(𝑢𝑖𝑑 )𝑞(𝑢 𝑗 𝑑 )𝑑𝑢𝑖𝑑 , 𝑑𝑢 𝑗 𝑑
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
∑︁ #
∑︁ ∑︁
2
] + 𝐸 [𝑢 2𝑗 𝑑 ] − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑢 𝑗 𝑑 ]
= ˜0 −
𝑦𝑖 𝑗 𝛼 𝐸 [𝑢𝑖𝑑
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
∑︁ #
∑︁ ∑︁
𝑉 𝑎𝑟 [𝑢𝑖𝑑 ] + 𝐸 [𝑢𝑖𝑑 ] 2 + 𝑉 𝑎𝑟 [𝑢 𝑗 𝑑 ] + 𝐸 [𝑢 𝑗 𝑑 ] 2 − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑢 𝑗 𝑑 ]
= ˜0 −
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" #
∑︁ ∑︁
= ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖
37
𝑁 ∑︁
∑︁ 𝑀
𝑦𝑖𝑎 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 ]
𝑖=1 𝑎=1
∑︁𝑁 ∑︁𝑀 ∫
= 𝑦𝑖𝑎 𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1
𝑁 ∑︁𝑀
" ∫ #
∑︁
= ˜1 −
𝑦𝑖𝑎 𝛼 (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1
𝑁 ∑︁𝑀
" 𝐷
∫ ∑︁ #
∑︁
= ˜1 −
𝑦𝑖𝑎 𝛼 (𝑢𝑖𝑑 − 𝑣 𝑎𝑑 ) 2 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" 𝐷
∑︁ ∫ ∫ ∫ ∫ #
∑︁
2
𝑣 2𝑎𝑑 𝑞(𝑣 𝑎𝑑 )𝑑𝑣 𝑎𝑑 −
= ˜1 −
𝑦𝑖𝑎 𝛼 𝑢𝑖𝑑 𝑞(𝑢𝑖𝑑 )𝑑𝑢𝑖𝑑 + 2𝑢𝑖𝑑 𝑣 𝑎𝑑 𝑞(𝑢𝑖𝑑 )𝑞(𝑣 𝑎𝑑 )𝑑𝑢𝑖𝑑 , 𝑑𝑣 𝑎𝑑
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁
𝑀
" 𝐷
∑︁ #
∑︁
2
] + 𝐸 [𝑣 2𝑎𝑑 ] − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑣 𝑎𝑑 ]
= ˜1 −
𝑦𝑖𝑎 𝛼 𝐸 [𝑢𝑖𝑑
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" 𝐷
∑︁ #
∑︁
𝑉 𝑎𝑟 [𝑢𝑖𝑑 ] + 𝐸 [𝑢𝑖𝑑 ] 2 + 𝑉 𝑎𝑟 [𝑣 𝑎𝑑 ] + 𝐸 [𝑣 𝑎𝑑 ] 2 − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑣 𝑎𝑑 ]
= ˜1 −
𝑦𝑖𝑎 𝛼
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" #
∑︁
= ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑦𝑖𝑎 𝛼
𝑖=1 𝑎=1
The last 2 expectations of the log functions can be simplified using Jensen’s inequality and
𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] is now:
𝑖𝑖𝑑
Recall 𝒖 𝒊 , 𝒖 𝒋 are 𝐷 × 1 column vectors. Define u = 𝒖˜ 𝒊 − 𝒖˜ 𝒋 . Then we have, 𝒖 𝒊 − 𝒖 𝒋 =
𝑁 (u, 2Λ̃0 ), where u is a 𝐷×1 vector and Λ̃0 is an 𝐷×𝐷 positive semidefinite matrix. Further
define Z = (2Λ̃0 ) −1/2 (𝒖 𝒊 − 𝒖 𝒋 − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )). Then clearly Z follows 𝐷 dimensional multivariate
38
standard normal distribution and its density function is given by 𝑓 𝑍 (𝑧) = √1 exp(− 21 z𝑇 z).
2𝜋
Consequently, we have 𝒖 𝒊 − 𝒖 𝒋 = 2Λ̃10/2 Z + u.
Therefore, we can reparameterize
Now define 𝑄 = u(2Λ̃0 + 21 I) −1 (2Λ̃0 ) 1/2 . Then the above integral becomes
∫ !
1 1 1
√ exp − (Z − 𝑄)𝑇 (2Λ̃0 + I)(Z − 𝑄) − u𝑇 u + u𝑇 (2Λ̃0 + I) −1 (2Λ̃0 )u 𝑑Z
2𝜋 2 2
1 1
= exp − u𝑇 u + u𝑇 (2Λ̃0 + I) −1 (2Λ̃0 )u det(I + 4Λ̃0 ) − 2
2
1 −1 1
= exp − u (I − (2Λ̃0 + I) (2Λ̃0 ))u det(I + 4Λ̃0 ) − 2
𝑇
2
−1 1
= exp − u (4Λ̃0 + I) u det(I + 4Λ̃0 ) − 2 .
𝑇
The last line follows since for any two invertible matrices 𝐴 and 𝐵, if 𝐴 + 𝐵 is also invertible,
then by Henderson and Searle (1981)
39
" !#
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 + u𝑇 ( Λ̃0 + Λ̃1 ) 1/2 Z + u
" !#
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 ( Λ̃0 + Λ̃1 )Z − 2Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 u − u𝑇 u)
∫ !
1 1
exp − Z𝑇 Λ̃0 + Λ̃1 + I Z − 2Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 u − u𝑇 u 𝑑Z
=√
2𝜋 2
Now define 𝑄 = u( Λ̃0 + Λ̃1 + 12 I) −1 ( Λ̃0 + Λ̃1 ) 1/2 . Then the above integral becomes
∫ !
1 1 1
√ exp − (Z − 𝑄)𝑇 ( Λ̃0 + Λ̃1 + I)(Z − 𝑄) − u𝑇 u + u𝑇 ( Λ̃0 + Λ̃1 + I) −1 ( Λ̃0 + Λ̃1 )u 𝑑Z
2𝜋 2 2
1 1
= exp − u𝑇 u + u𝑇 ( Λ̃0 + Λ̃1 + I) −1 ( Λ̃0 + Λ̃1 )u det(I + 2Λ̃0 + 2Λ̃1 ) − 2
2
1 −1 1
= exp − u (I − ( Λ̃0 + Λ̃1 + I) ( Λ̃0 + Λ̃1 ))u det(I + 2Λ̃0 + 2Λ̃1 ) − 2
𝑇
2
−1 1
= exp − u (I + 2Λ̃0 + 2Λ̃1 ) u det(I + 2Λ̃0 + 2Λ̃1 ) − 2 .
𝑇
The last line follows since for any two invertible matrices 𝐴 and 𝐵, if 𝐴 + 𝐵 is also invertible,
then by Henderson and Searle (1981)
Finally, the Kullback-Leiber divergence between the variational posterior and the true
posterior is
40
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼
+ log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2
𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼
+ log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 𝒖𝒊
+ Const˜
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2
𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼
+ log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 𝒖𝒊
+ Const˜
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼
𝑭𝒊𝒂 = log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2
(13)
𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼
𝑇 −1
𝑭𝒊 = log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (14)
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2
41
" # −1
det(I + Λ̃0 + Λ̃1 ) 1/2
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
˜1)
exp( 𝛼
The second-order partial derivatives (Hessian matrices) of 𝑭𝒊 , 𝑭𝒊𝒂 with respect to 𝒖˜ 𝒊 are
−1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁
−1 𝑇 −1
𝑯 𝒊 ( 𝒖˜ 𝒊 ) = −2(I + 4Λ̃0 ) 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
2( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1
I −
1 + exp ( 𝛼˜0 ) 1/2 exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
det ( I+4Λ̃0 )
−1
𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) = −2(I + 2Λ̃0 + 2Λ̃1 )
𝑀
∑︁ det(I + 2Λ̃0 + 2Λ̃1 ) 1/2 −1
𝑇 −1
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑎=1
exp( ˜
𝛼 1 )
2( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1
I −
exp ( 𝛼
˜1 )
1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
det ( I+2Λ̃0 +2Λ̃1 ) 1/2
𝑀 𝑁
𝑇 1 ∑︁ ∑︁ 1
𝐾 𝐿 𝒖˜𝒊 = 𝒖˜ 𝒊 2
+ 𝑦𝑖𝑎 + (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) 𝒖˜ 𝒊
2𝜆 0 𝑖=𝑎 𝑗=1, 𝑗≠𝑖
2
𝑀 𝑁
∑︁ ∑︁ 1 1
𝒖𝒊
− 2˜ 𝑦𝑖𝑎 𝒗˜𝒂 + (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 − 𝑮 𝒊 ( 𝒖˜ 𝒊 ) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊 ) + (𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 )) 𝒖˜ 𝒊 .
𝑖=𝑎 𝑗=1, 𝑗≠𝑖
2 2
With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of 𝒖˜ 𝒊 by setting the partial derivative of KL equal to 0. Finally, we have
" 𝑁 𝑀
# −1
1 ∑︁ ∑︁ 1
𝒖˜ 𝒊 = + (𝑦 𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑦 𝑖𝑎 ) 𝑰 + 𝑯 𝒊 ( 𝒖
˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 )
2𝜆20 𝑗=1, 𝑗≠𝑖 𝑎=1
2
" 𝑁 𝑀
#
∑︁ ∑︁ 1 1
(𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 + 𝑦𝑖𝑎 𝒗˜𝒂 − 𝑮 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊 )
𝑗=1, 𝑗≠𝑖 𝑎=1
2 2
Similarly, we can obtain the closed form update rule for 𝒗˜𝒂 by taking the second order
Taylor-expansion of 𝑭𝒊𝒂 (see Equation 13) The gradient and Hessian matrix of 𝑭𝒊𝒂 with
respect to 𝒗˜𝒂 are
42
𝑁
∑︁
𝒗 𝒂 ) = −2(I + 2Λ̃0 + 2Λ̃1 ) −1
𝑮 𝒊𝒂 (˜ 𝒗 𝒂 − 𝒖˜ 𝒊 )
(˜
𝑖=1
" # −1
det(I + Λ̃0 + Λ̃1 ) 1/2
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
˜1)
exp( 𝛼
𝒗 𝒂 ) = −2(I + 2Λ̃0 + 2Λ̃1 ) −1
𝑯 𝒊𝒂 (˜
𝑁
∑︁ det(I + 2Λ̃0 + 2Λ̃1 ) 1/2 −1
𝑇 −1
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1
˜1)
exp( 𝛼
2(˜𝒗 𝒂 − 𝒖˜ 𝒊 )(˜𝒗 𝒂 − 𝒖˜ 𝒊 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1
I −
exp ( 𝛼
˜1 )
1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
det ( I+2Λ̃0 +2Λ̃1 ) 1/2
𝑁 ∑︁
𝑀 𝑁 ∑︁𝑀
1 ∑︁ 1 ∑︁ 1
𝐾 𝐿 ˜𝒗 𝒂 = 𝒗˜𝒂 𝑇 + 𝑦 𝑖𝑎 − 𝑯 𝒊𝒂 𝒗
(˜ 𝒂 ) 𝒗
˜ 𝒂 − 𝒗
2˜ 𝒂 𝑦 𝑖𝑎 𝒖
˜ 𝒊 − 𝑮 𝒊𝒂 𝒗
(˜ 𝒂 ) .
2𝜆21 𝑖=1 𝑖=𝑎 2 𝑖=1 𝑖=𝑎
2
With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of 𝒗˜𝒂 by setting the partial derivative of KL equal to 0. Then, we have
" 𝑁
! # −1
1 ∑︁ 1
𝒗˜𝒂 = + 𝑦𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 (˜
𝒗𝒂)
2𝜆21 𝑖=1 2
" 𝑁 #
∑︁ 1
𝑦𝑖𝑎 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 (˜
𝒗𝒂)
𝑖=1
2
To find the closed form updates of Λ̃0 and Λ̃1 we used the first-order Taylor-expansions of
𝑭𝒊 and 𝑭𝒊𝒂 . The gradients of 𝑭𝒊 and 𝑭𝒊𝒂 with respect to Λ̃0 are:
−1
" #
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁
𝑮 𝒊 ( Λ̃0 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( ˜
𝛼 0 )
!
−1 𝑇 −1 1
4(I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) − I
2
−1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
𝑁 ∑︁
∑︁ 𝑀
𝑮 𝒊𝒂 ( Λ̃0 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( ˜
𝛼 1 )
!
1
2(I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 − I
2
43
The gradients of 𝑭𝒂 and 𝑭𝒊𝒂 with respect to Λ̃1 are:
" # −1
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
𝑁 ∑︁
∑︁ 𝑀
𝑮 𝒊𝒂 ( Λ̃1 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( 𝛼 ˜1)
!
1
2(I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 − I
2
𝑁 ∑︁
𝑀 𝑁 ∑︁𝑁
𝑁 ∑︁ ∑︁ 𝑁
𝐾 𝐿 Λ0 = tr( Λ̃0 ) 2
+ 𝑦𝑖𝑎 + 2 𝑦𝑖 𝑗 − log(det( Λ̃0 )) + 𝑮 𝒊 ( Λ̃0 ) Λ̃0 + 𝑮 𝒊𝒂 ( Λ̃0 ) Λ̃0
2𝜆 0 𝑖=1 𝑎=1 𝑖=1 𝑗=1
2
𝑁 𝑀
𝑀 ∑︁ ∑︁ 𝑀
𝐾 𝐿 Λ1 = tr( Λ̃1 ) + 𝑦𝑖𝑎 − log(det( Λ̃1 )) + 𝑮 𝒊𝒂 ( Λ̃1 ) Λ̃1
2𝜆20 𝑖=1 𝑎=1 2
With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of Λ̃0 Λ̃1 by setting the partial derivative of KL equal to 0. Then, we have
˜0, 𝛼
M-step: Estimate 𝛼 ˜ 1 and 𝛼
˜ 2 by minimizing the KL divergence. To find the closed
˜0, 𝛼
form updates of 𝛼 ˜ 1 and 𝛼
˜ 2 , we used second-order Taylor-expansions of the log functions
˜0, 𝛼
and set the partial derivatives of KL with respects to 𝛼 ˜ 1 and 𝛼
˜ 2 as zeros. Then we have
Í𝑁 Í𝑁
𝑖=1 𝑗≠𝑖, 𝑗=1 𝑦 𝑖 𝑗 − 𝑔𝑖 ( 𝛼
˜0) + 𝛼˜ 0 ℎ𝑖 ( 𝛼
˜0)
˜0 =
𝛼
ℎ𝑖 ( 𝛼
˜0)
Í𝑁 Í 𝑀
𝑦𝑖𝑎 − 𝑔𝑖𝑎 ( 𝛼 ˜2) + 𝛼˜ 2 ℎ𝑖𝑎 ( 𝛼
˜2)
˜ 1 = 𝑖=1 𝑎=1
𝛼
ℎ𝑖𝑎 ( 𝛼˜2)
where
" # −1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁
𝑇 −1
𝑔𝑖 ( 𝛼
˜0) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
" # −1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁
ℎ𝑖 ( 𝛼
˜0) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
44
" # −1
˜0)
exp( 𝛼
1+ exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
det(I + 4Λ̃0 ) 1/2
−1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
∑︁𝑁 ∑︁𝑀
𝑔𝑖𝑎 ( 𝛼
˜1) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( ˜
𝛼 1 )
−1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
∑︁𝑁 ∑︁𝑀
𝑇 −1
ℎ𝑖𝑎 ( 𝛼
˜1) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( 𝛼˜1)
−1
" #
˜1)
exp( 𝛼
1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
det(I + 2Λ̃0 + 2Λ̃1 ) 1/ 2
45