0% found this document useful (0 votes)
1 views

Chap2-multi-view-clustering-1

This chapter surveys multi-view clustering (MVC), a method in unsupervised learning that utilizes multiple representations of data objects to enhance clustering accuracy. It discusses various MVC methods, including K-Means-based approaches, matrix factorization, and spectral methods, while aiming to make the content accessible to non-specialists. The chapter also outlines the importance of multi-view data in modern applications and provides a framework for understanding the diversity of clustering techniques employed in this field.

Uploaded by

YoubOubodib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Chap2-multi-view-clustering-1

This chapter surveys multi-view clustering (MVC), a method in unsupervised learning that utilizes multiple representations of data objects to enhance clustering accuracy. It discusses various MVC methods, including K-Means-based approaches, matrix factorization, and spectral methods, while aiming to make the content accessible to non-specialists. The chapter also outlines the importance of multi-view data in modern applications and provides a framework for understanding the diversity of clustering techniques employed in this field.

Uploaded by

YoubOubodib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 2

Multi-View Clustering

Deepak P and Anna Jurek-Loughrey

Abstract With a plethora of data capturing modalities becoming available, the


same data object often leaves different kinds of digital footprints. This naturally
leads to datasets comprising the same set of data objects represented in different
forms, called multi-view data. Among the most fundamental tasks in unsupervised
learning is that of clustering, the task of grouping data objects into groups of related
objects. Multi-view clustering (MVC) is a flourishing field in unsupervised learning;
the MVC task considers leveraging multiple views of data objects in order to arrive
at a more effective and accurate grouping than what can be achieved by just using
one view of data. Multi-view clustering methods differ in the kind of modelling they
use in order to fuse multiple views, by managing the synergies, complimentarities,
and conflicts across data views, and arriving at a single clustering output across
the multiple views in the dataset. This chapter provides a survey of a sample of
multi-view clustering methods, with an emphasis on bringing out the wide diversity
in solution formulations that have been considered. We pay specific attention to
enable the reader understand the intuition behind each method ahead of describing
the technical details of the method, to ensure that the survey is accessible to readers
who may not be machine learning specialists. We also outline some popular datasets
that have been used to empirically evaluate MVC methods.

2.1 Introduction

Exploratory data analysis is becoming increasingly important, with massive


amounts of data being created every moment, vastly outpacing any chance of
processing them manually. Modern data scenarios routinely embrace complex
objects, whose representations encompass multiple forms—often called views—
possibly containing even different types of data, such as text, images, sets, and

Deepak P () · A. Jurek-Loughrey


Queen’s University Belfast, Belfast, UK
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2019 27


Deepak P, A. Jurek-Loughrey (eds.), Linking and Mining Heterogeneous
and Multi-view Data, Unsupervised and Semi-Supervised Learning,
https://doi.org/10.1007/978-3-030-01872-6_2
28 Deepak P and A. Jurek-Loughrey

sequences. As a simple example, our social media streams are often occupied
by a mix of text (which itself appears in multiple forms), images, and videos.
Archeological artifacts are often represented electronically using their geo-location,
their 3D view, and the properties of their ingredients. The extravagance of data
is best pronounced in fields that involve constant monitoring such as astronomy
where time series data of incoming radiation is incessantly captured along with
other sensoring modalities. The emergence of persuasive health technologies such
as activity trackers that hold a variety of sensors has seen massive amounts of
multi-sensor data captured at the individual level.
A dataset is said to be multi-view if it comprises multiple data representations—
called views—and is said to be parallel if objects in the dataset are represented
across the multiple views. It may be noted that some objects may not have a
representation in certain views. Social media posts comprising [userid, text, image,
geolocation] tuples are thus a parallel 4-view dataset, since each social media
post would be associated with a userid, some text, (optionally) an image, and
a geo-location (if enabled). Multi-view parallel data is increasingly becoming
ubiquitous with information being captured using a variety of different modalities,
and their prevalence cannot be overemphasized. To outline two exemplary domains,
observe that healthcare systems often capture the same disease condition using
different medical sensors (e.g., EEG, fMRI, and PET are different ways of capturing
neurological information), and criminal records often represent the same crime
using modalities such as textual narratives, CCTV footages, audio tapes, and
photographs. Since our focus in this paper is multi-view parallel data, we will simply
refer to it as multi-view data in the remainder of the chapter.
With the increasing availability of multi-view data across a variety of scenarios,
and manual labelling being expensive and impractical in many big data scenarios,
multi-view unsupervised learning, the discipline that addresses classical unsuper-
vised learning tasks—viz., clustering [22], dimensionality reduction (e.g., [37]), and
outlier detection [18]—over multi-view data, has witnessed massive attention from
the scholarly community. In this paper, we provide an overview of the major lines
of research in clustering for multi-view data, often referred to simply as multi-view
clustering (MVC).
Clustering Clustering is a fundamental task in machine learning, and focuses on
grouping objects in a dataset into multiple groups, called clusters. The typical
criterion for grouping is that objects that are put into the same group should be
more similar to each other than objects that are put into different groups. A classical
algorithm for clustering [33], the framework of which still forms the backbone of
many modern clustering algorithms, dates back to the 1960s. With the decision
about the cluster membership of each data object being dependent on the clustering
assignments for the other objects in the dataset, the task of clustering often boils
down to optimizing for a dataset-wide objective function. A core building block
leveraged by clustering algorithms is the choice of the measure of similarity between
objects in the dataset being subject to clustering. With different domains and data
types from them often requiring similarity measures that are tailored to their needs,
2 Multi-View Clustering 29

a variety of clustering algorithms have emerged over the last many decades, many
of which have become the most highly cited papers in their respective realms; these
include clustering algorithms for gene data [15] and time series [29]. The similarity
between objects is typically computed as an aggregate of the pairwise similarities of
their attributes. In the case of multi-view data, with each data object having multiple
views, and each view having a different set of features, there is another level of
hierarchy that multi-view clustering techniques need to be cognizant of, to exploit.
Overview of the Paper This paper presents an overview of the state-of-the-art in
multi-view clustering, with a focus on covering the different families of methods that
have been proposed for the task. Our intent, in slight contrast to a regular survey, is
to provide a high-level picture of multi-view clustering methods which is accessible
to a generalist who may not be necessarily familiar with the various families of
mathematical building blocks that are employed within each method. In addition,
we pay particular attention to providing information to enable the reader appreciate
the unique characteristics of each family or method, which may make it the preferred
method for a particular niche scenario in multi-view clustering. With the focus being
on formulations, we omit details about empirical evaluation of the methods, details
of which may be obtained from the individual papers. In addition to researchers, we
expect our paper to be useful to practitioners who may be looking to decide on a
particular multi-view clustering method for usage within their data setting.
Outline We start with outlining the task of multi-view clustering and introducing
necessary notation that would be used in the subsequent narrative. Next, we provide
a broad outline of the different families of methods that we will cover in our survey,
followed by a section each for each family of methods. This will be followed by a
section describing a few relaxations to the multi-view clustering problem that have
been explored in the literature. We then list a set of datasets that have been explored
for evaluating MVC methods and then conclude the chapter.

2.2 Multi-View Clustering: The Task

The input to the clustering task is a multi-view dataset, which we will represent
as X = {. . . , x, . . .}. We use V = {. . . , v, . . .} to represent the set of views in
the dataset. Each of these views may comprise multiple attributes, and each multi-
view object x takes a value for each attribute within each view, with xv .a denoting
the value it takes for attribute a within view v. In many clustering formulations,
the number of desired output clusters is also an input parameter, which we will
refer to as k. As alluded to earlier, clustering algorithms typically make use of
a quantification of similarity between values for each attribute, which may be
rolled up to the object–pair level, both of which are denoted by s(., .), what is
denoted being easily identifiable from the context. Some clustering formulations
use a distance function instead of a similarity function, in which case the distance
function is referred to as d(., .).
30 Deepak P and A. Jurek-Loughrey

Table 2.1 Notations X A multi-view dataset


x A multi-view object within X
n Number of objects in X
V A set of views represented in a multi-view
dataset
v A view within V
m Number of views in V
Xv The subset of X corresponding to view v
xv The view v representation of object x
Av The attributes or features within view v
a An attribute within a view
xv .a The value taken for attribute a within view v
by object x
d(., .) A pre-specified distance function that quanti-
fies the distance
between its two input values; we overload this
notation to
denote both the distance between two values,
and its
aggregation to the object level
s(., .) A pre-specified similarity function analogous
to d(., .)
C The set of clusters in the generated clustering
k The number of clusters in C
C A cluster within C
x.C The cluster assigned to the object x
C.p The centroid or prototype object for cluster C

Most formulations of MVC output a crisp grouping of all objects to clusters;


typical clustering outputs are a partition of the dataset, in the sense each object x is
necessarily assigned to a unique cluster, denoted by x.C. We use C = {. . . , C, . . .} to
denote the clustering output by the MVC method. These notations are summarized
in Table 2.1.
Thus, at a task level, the MVC task may be seen as accomplishing the following
task of using a dataset to arrive at a clustering:

X→C (2.1)

This may alternatively be written as:

{. . . , x, . . .} → {. . . , x.C, . . .} (2.2)

While most MVC methods agree to this general framework and may be described
using the above notation which are also tabulated in Table 2.1, there are clustering
algorithms that require more terminology to describe. We will introduce such
2 Multi-View Clustering 31

specific terminology as and when we describe those methods. Additionally, it may


be noted that some single-view clustering formulations that may leave out some
dataset objects from the clustering output (e.g., [1]) have not been explored for
devising MVC methods.

2.3 Overview of Multi-View Clustering Methods

We now group MVC methods into groups based on the technical character of the
algorithms. These groupings, while being broadly based on the framework they use,
are not fully objective. Some families may legitimately be seen as overlapping;
however, we believe that our groupings will help provide a framework towards better
understanding the variety and diversity of methods used for MVC. Our groups are
listed below, with each group being described in detail in a separate subsequent
section.
– K-Means-Based Approaches: K-Means, a classical algorithm for cluster-
ing [33] single-view data, still holds much sway in the clustering community
after half-a-century [21]. Thus, the largest set of MVC methods build upon the
K-Means framework.
– Matrix Factorization: The dataset corresponding to each view may easily be
represented as a matrix with data objects corresponding to rows, and attributes
corresponding to columns. This representation easily yields to matrix factoriza-
tion approaches, particularly those from nonnegative matrix factorization [28].
There have been various flavors of the MVC task that have been addressed using
matrix factorization methods.
– Topic Modelling-based Approaches: Topic models [3] seek to model docu-
ments as a mixture of topics, with each word being drawn from a topic that has
representation in the document. There have been several methods that draw upon
the idea of topic modelling, of which probabilistic latent semantic analysis [19]
has seen much uptake towards crafting methods for MVC.
– Spectral Methods: In a broad sense, spectral methods make usage of the
spectrum, i.e., the set of eigenvalues, and of the similarity matrix of the data
to perform clustering of the dataset. These, at a fundamental level, relate to
graph representations that model the similarities between data points. Spectral
clustering methods, such as [38], have seen much interest in the image processing
community, and have been adapted to the general MVC task as well.
– Techniques using Exemplars: Exemplars are typically used to refer to a proxy
object, whether it be for a cluster or an individual data object. These lend well to
belief and affinity propagation models (e.g., [17]), which have also inspired the
design of some MVC methods.
– Miscellaneous: In this family, we cover MVC methods that do not necessarily
fit well within any of the above classifications. These include techniques that are
inspired by canonical correlation analysis [40] and co-clustering [12].
32 Deepak P and A. Jurek-Loughrey

The expert would rightly observe that there is a nontrivial overlap between
families; for example, K-Means clustering may be seen as an instance of matrix
factorization, and so could be topic modelling. Thus, our categorization is intended
to give an organization that a researcher would easily relate to, and does not imply
that the separate families are disjoint or unrelated.

2.4 K-Means Variants for MVC

We first start with a description of the K-Means clustering method [33], a popular
clustering algorithm for single-view data. K-Means targets to produce a pre-
specified number of clusters, denoted as k, in the output. Each cluster is represented
by a prototype, a virtual data object that is modelled as the mean of all the data
objects assigned to the cluster. Simplistically, K-Means creates a cluster assignment
towards optimizing the following objective function:

d(x, x.C.p) (2.3)
x∈X

Thus, the clustering allocation is made in a way that the sum of distances of
each object x to the prototype of cluster to which it is assigned, denoted by x.C.p,
is minimized. The distance function is typically modelled as the L2 norm1 of the
vector of distances between the objects over the set of attributes under consideration.

d(x, y) = (x.a − y.a)2 (2.4)
a∈A

In the single-view formulation, there is only one set of attributes, A, given that we
only have a single-view representation for each object. The K-Means formulation
may be thought of as an instance of the Expectation Maximization algorithm [11],
where two sets of parameters, the cluster assignments {. . . , x.C, . . .}, and the
cluster centroids {. . . , C.p, . . .} are optimized in an alternating fashion iteratively
until convergence. As may be obvious for a reader familiar with EM, the cluster
assignment corresponds to the E-step and the centroid learning corresponds to the
M-step. Since K-Means could converge at local minima, the initialization of clusters
in order to kick-start the iterative learning process is often regarded to be critical.

1 http://mathworld.wolfram.com/L2-Norm.html.
2 Multi-View Clustering 33

2.4.1 Alternating K-Means for Two-View Data

In probably what could be among the earliest works in MVC [2], extensions to the
K-Means method for multi-view data were proposed. This was specifically tailored
to two-view data, with a focus on document and webpage clustering. In what could
be regarded as a simplistic extension to K-Means, they propose to interleave the EM
steps corresponding to each view. Starting from an initialized cluster membership, a
sequence of M and E steps are performed by using just the data from one view (to re-
emphasize, the data from the other view is not used) to arrive at a cluster allocation
for the data objects. The cluster allocation is taken to then perform the M and E
steps using just the data from the other view. In summary, each iteration involving
the sequence of M and E steps uses one particular view, with the immediate next
iteration shifting the focus to the other view. Thus, the clustering information across
views flows across iterations through the clustering allocations. At the end of a
sequence of such view-alternating iterations, two sets of clustering allocations are
produced, each one corresponding to the latest allocation from each view. These
clustering allocations are then merged in a post-processing step in order to arrive at
a single clustering for the data objects.
At the task level, it is notable that the method is proposed with two-view data in
mind; however, a simple extension that executes iterations in round-robin fashion
by cycling through three or more views may be envisaged for multi-view data
incorporating three or more views (the empirical performance would need to be
investigated). Another notable feature of this method is that there is no provision
for weighting the two views differently with regard to their influence in the final
clustering output. Such differential weighting, we will see, has been the focus of
many later K-Means variants.

2.4.2 Max/Min Fusion Within K-Means

A very trivial extension to the K-Means objective function can be arrived at by just
summing up the distances across views, and optimizing for the sum of distances:

d(xv , (x.C.p)v ) (2.5)
x∈X v∈V

where (x.C.p)v denotes the view v representation of the cluster centroid to which x
belongs. The view-specific distance function being simply the L2 distance over the
attributes in that view.

d(xv , yv ) = (xv .a − yv .a)2 (2.6)
a∈Av
34 Deepak P and A. Jurek-Loughrey

It may be noted that the sum of distances is clearly equivalent to the average
of distances across views, with the number of views being the same across
all data objects. Motivated by scenarios from community question answering
systems, where the multi-view dataset comprises two-view objects with question
(Q) and answer (A) text forming separate views, Deepak [9] proposes replacing the
average/sum aggregation of distances across views by the min function. Thus, the
membership of an object in a cluster is determined by computing the distance of the
object to each cluster prototype aggregated across views, the aggregation performed
by using the min function. Thus, the objective function changes to:

min{d(xv , (x.C.p)v )|v ∈ V } (2.7)
x∈X

In summary, if x is very proximal to a cluster prototype in one of the views, it


will be assigned to that cluster regardless of the distance of x to the same cluster’s
prototype in other views. This is motivated by scenarios where the similarity
information is localized in certain views; for example, this formulation places QA
pairs that are highly similar on either the Q or A views, within the same cluster.
With the min function being non-differentiable, Deepak [9] proposes usage
of a differentiable approximation involving exponentiation. The approximation
additionally is applicable to using max aggregation instead of min, though the focus
of their work is min aggregation. The paper observes that this formulation, much
like the previous one, is trivially extensible to more than two views. In another point
of similarity with the earlier work, there is no intrinsic method to pre-specify that
one of the views should influence the clustering decisions more. However, the paper
notes that scaling d(xv , yv ) by a view-specific weight would allow the user to make
such tuning; in such a case, a lower weight would lead to the view being able to
influence the clustering more, the aggregation function being min.

2.4.3 View-Weighted K-Means with L2,1 Norm

Let us now look at the relation between K-Means and matrix factorization. Single-
view (relaxed [13]) K-Means clustering may be written as a nonnegative matrix
factorization problem with the objective:

min ||X − GF ||F2 (2.8)


G,F


k
s.t Gij ∈ {0, 1}, Gij = 1
j =1
2 Multi-View Clustering 35

where X is the input single-view data matrix (given that single-view is a specializa-
tion of multi-view with the number of views being unity, we use the same variable
X) with n rows and as many columns as there are attributes in the view, G being
a clustering indicator matrix of n × k, and F being a cluster centroid matrix with
one row per cluster. It may be noted that the constraints placed on G enforce that
each object is assigned to only one of the k clusters, with k being a user-specified
parameter. The solutions, G (clustering assignment) and F (cluster centroids) are
arrived at by minimizing the square of the Frobenius norm2 of the difference matrix
(indicated by ||M||2F ), which is essentially the sum of the squares of all elements in
the difference between X and GF matrices.
In extending this to multi-view data, we would need to account for m matrices
of data, one corresponding to each data view, as well as m cluster centroid matrices
(again, one for each view). However, given the MVC task, there needs to be a single
cluster indicator matrix across all views. This would yield

min ||Xv − GFv ||F2 (2.9)
G,{...,Fv ,...}
v∈V

with the usual constraints on the G matrix being applied. In contrast to such an
extension, Cai et al. [7] allow for view weights and use the L2,1 norm instead of the
Frobenius norm, leading to the following:

min (wv )λ ||Xv − GFv ||2,1 (2.10)
G,{...,Fv ,...}
v∈V

For a matrix M, the L2,1 norm sums up, across all column vectors of M, the sum
of squares of their components. In the above case, it comes down to the L1 norm in
the data-points direction, and the L2 norm in the features direction. The L2,1 norm
is popular in scenarios where robustness is desired (e.g., [36]), a feature that the
authors of [7] argue as being important in MVC. The wv are view weights that are
learnt within the optimization process, with λ being a parameter that would control
the weights distribution. The modified objective leads to different update rules that
are detailed in [7]. Unlike the earlier two papers, this method quantifies the influence
each view could have in the clustering process; however, the views are learnt in the
process of optimization (not pre-specified by the user).

2.4.4 View and Attribute Weighting Within K-Means

After work on view-specific weighting, almost as a natural next step, algorithms for
MVC were proposed that use attribute-weighting within views. Jiang et al. [24]

2 http://mathworld.wolfram.com/FrobeniusNorm.html.
36 Deepak P and A. Jurek-Loughrey

propose extending the basic K-Means model along that direction, leading to an
objective function:
   2
(wv )α (za )β xv .a − (x.C.p)v .a (2.11)
x∈X v∈V a∈Av


wv = 1
v∈V


∀ v ∈ V, za = 1
a∈Av

Thus, the distance between the data point and the cluster prototype along each
attribute within each view is scaled twice, first by the view-specific weight for
the view, and second by the attribute-specific weight. Additionally, the weight
distributions are controlled by respective exponents α and β. Further, as indicated
above, it is enforced that the view weights across views sum to unity, as well as that
the attribute weights across attributes within each view sum to unity. The learning
process sequentially learns: (1) cluster centroids, (2) attribute weights, (3) view
weights, and (4) cluster assignments, in four different steps within each iteration.
Along a similar direction, Chen et al. [8] propose the usage of additional terms
quantifying the negative entropies of the view and feature weights, so most views
and features are called into play, unless there is a compelling reason to focus on just
a few. Their objective function assumes the following form:
   2 
wv (za ) xv .a − (x.C.p)v .a + η za log(za )
x∈X v∈V a∈Av
 
a∈ ∪v∈V Av

+λ wv log(wv ) (2.12)
v∈V

along with the sum-to-unity constraints on z and w as earlier. Including the negative
entropies in minimization is motivated by the previous work on similar lines [25].
Similar to what is done in [24], the four sets of parameters are sequentially
optimized for within each iteration.
In yet another variation, Xu et al. [47] propose the usage of a regularizer to
control the sparsity over the feature weights, leading to the following objective:
   2 
(wv )α (za ) xv .a − (x.C.p)v .a + β ||{za |a ∈ Av }||2 (2.13)
x∈X v∈V a∈Av v∈V

along with the sum-to-unity constraints as earlier. The regularizer avoids attaining
a configuration where only a few features are selected, which would lead to a small
meaningless (sic) objective value despite being small.
2 Multi-View Clustering 37

All the methods covered in this subsection, as in the previous, learn view and
attribute weights within the learning process. This leads them to quantifying the
influence of attributes and views; however, being part of the learning process, they
are not pre-specifications from the user side on their respective influences.

2.5 Matrix Factorization Approaches to MVC

Much like in the previous section, we start with outlining the basic framework of
nonnegative matrix factorization (NMF), the matrix factorization family that has
been explored widely in clustering. When X, the data matrix (each row being a
data object), is nonnegative, NMF seeks to arrive at a decomposition of it into two
matrices using, in most cases, the following objective function:

min ||X − GF ||2F (2.14)


G,F

where G is an n × l matrix, and F is an l × |A| matrix, and each of their elements


are constrained to be nonnegative. We observed earlier that when l = k and one-
of-k coding constraints are imposed on G, it comes closer to a K-Means clustering
formulation. In a sense, for general l, one could consider the F matrix as modelling
l object prototypes, with each object in X being constructed as a linear combination
of the prototypes in F using the weights from G. Under this model, F may be called
as the basis matrix and G is the coefficient matrix. The coefficient matrix may be
considered as providing a representation for each object in X within a latent (low-
rank) space. Further, a simple clustering may be achieved by associating each data
object in X with one of l clusters, specifically, the cluster with which it has the
highest coefficient. We will consider three MVC methods that build upon NMF, in
this section.

2.5.1 Joint NMF for MVC

In what is probably the first method using NMF directly for MVC, Liu et al. [30]
propose factorizing the different data matrices, i.e., Xv s, for MVC. However,
since a single clustering solution across the views is what is desired, the separate
factorizations need to be done jointly in order to achieve similar coefficient matrices
from the different factorizations. Further, the optimization function also involves
learning of a cross-view coefficient matrix that will eventually be used in order to
generate the clustering. The proposed objective function is thus the following:
 
min ||Xv − Gv Fv ||2F + wv ||Gv − G◦ ||2F (2.15)
Gv ,Fv ,G∗
v∈V v∈V
38 Deepak P and A. Jurek-Loughrey

Note that instead of doing pairwise comparisons of view-specific coefficient


matrices, the above form has a term for a cross-view consensus matrix G◦ from
which the deviations are quantified in the second term. In order to ensure that the
different coefficient matrices are comparable, they additionally impose that the basis
vectors (i.e., within F ) have components summing up to unity. Notationally, the
constraint is the following:

l 
∀1 ≤ i ≤ k, Fij = 1
j =1

Do note that this is in addition to the nonnegativity constraints on all factorized


matrices. The optimization is performed in an iterative framework, where, within
every iteration, the Fv and Gv for all views are optimized for, in addition to
optimizing for G◦ . At the end of the learning process, G◦ is achieved, which may
be used as a new representation for objects in X to be subjected to single-view K-
Means to arrive at a clustering. Alternatively, for each object, the cluster with which
it has the highest coefficient (only if l = k) may be assigned as the cluster to which
the object belongs, resulting in an MVC output.
Unlike the methods seen so far, the set of view weights, wv s, are pre-specified
weights that are not altered/learnt in the course of the optimization framework.

2.5.2 Manifold Regularized NMF for MVC

The joint NMF was closely followed by another extension of NMF for MVC that
is based on manifold regularization [51]. They draw inspiration from previous
work [6] that remedies a “deficiency” in classical NMF, one that relates to preserving
space geometry. Informally, they argue that two data objects in a dataset being
close enough in the intrinsic geometry of the distribution should entail that their
representations (i.e., the coefficient representations from G) be close to each other.
In particular, when a graph representation of data objects is available, objects that
are connected to each other need to be proximal in their NMF representations as
well. Consider a graph representation of objects in X where an edge is induced
between pairs of objects if one of them appears in the other’s k nearest neighbors,
and let L be the n × n Laplacian matrix3 of such a graph. Cai et al. [6] propose
that preservation of intrinsic geometry across objects in X is better achieved if the
following regularizer be added to the usual NMF objective:

· · · + tr(GT LG) (2.16)

3 https://en.wikipedia.org/wiki/Laplacian_matrix.
2 Multi-View Clustering 39

where tr(X) denotes the trace4 of the matrix X. With multiple views entailing
multiple Laplacian matrices (each view-specific matrix represented as Lv ), Zong
et al. [51] model the task using the following objective:

min◦ D(Xv ||Gv Fv ) + D(Gv ||G◦ ) + D(Lv ||L◦ ) + λ tr((G◦ )T L◦ G◦ )
Gv ,Fv ,L ,G◦
v∈V
(2.17)
where D(X||Y ) denotes the cost function to quantify the difference between X and
Y . While the first and second terms are familiar from having encountered in [30]
above, the third term forces the learning of a cross-view consensus graph structure
and the fourth term incentivizes preservation of local geometry in the consensus
space. The paper proposes different variations of multimanifold regularized NMF
based on the above framework, each of which vary on some aspects of the objective
function construction (some variants also provision for view weights) and the
computation involved in the optimization process.

2.5.3 Deep Matrix Factorization for MVC

Semi-NMF An attractive feature of NMF is the interpretability with each data point
being represented as a linear combination of the different bases (from the basis
matrix) with nonnegative coefficients. Semi-NMF [14] is a modification of NMF
that relaxes the nonnegativity constraints on the data and the basis matrix, while
retaining interpretability by enforcing that the coefficient matrix is nonnegative.
Informally, it makes NMF applicable for mixed-sign data (i.e., some entries of the
X matrix may be negative) and drops the nonnegativity constraint on the entries in
the F matrix.
Extending Semi-NMF for MVC Zhao et al. [50] build upon Semi-NMF to
formulate a sequence of transformations via many basis matrices. For single-view
data, this assumes the form:

min ||X − GFr Fr−1 . . . F1 ||2F (2.18)


G,∀i,Fi

where there are r basis matrices to be estimated, their dimensionality controlled


appropriate to the complexity of the model desired. For multi-view data, they
enforce that the representation is common across all views (i.e., G is shared) and add
the manifold regularization term in order to ensure that the common representation
preserves the local geometry within each view. This leads to the following objective
function construction:

4 https://en.wikipedia.org/wiki/Trace_(linear_algebra).
40 Deepak P and A. Jurek-Loughrey

  
min (wv )α ||Xv − GFrv Fr−1
v
. . . F1v ||2F + βtr(GT Lv G) (2.19)
G,∀v,wv ,∀i,v,Fiv
v∈V

with Lv , as earlier, being the graph Laplacian for view v constructed using k nearest
neighbor-induced edges. The usual sum-to-unity constraints are applied on the view
weights, wv . The learnt representation G is then clustered to arrive at an MVC
result. The length of the sequence of transformations, it may be noted, much like
the number of layers in a neural network, may be controlled to learn representations
at the level of data features at different levels of abstraction; r may thus be set
appropriately. It is also noteworthy here that unlike the earlier NMF approaches,
this one can handle mixed-sign data.

2.6 Topic Modelling-Based Approaches

Topic-modelling, a technology from the text processing community, considers


learning a layer in between documents and their component words, called a set of
topics. Each topic is usually represented as a probability distribution over words in
the vocabulary, with each topic having a higher probability associated with words
that typify the topic. Using the observed documents and the words they comprise,
a set of topics, their number controlled by a pre-specified parameter, is learnt. We
provide a brief overview of probabilistic latest semantic analysis [19], a popular
model for learning topics.
A topic is modelled, as outlined earlier, as a probability distribution over the
vocabulary of words, and a document has a probability distribution over the topics.
Let these be represented as P (w|t) and P (t|d), respectively. Then, the likelihood of
occurrence of a word w in document d may be written out as:

P (w, d) = P (w|t)P (t|d) (2.20)
t∈T

where T is the set of topics. Now, given a corpus of documents, topics may be learnt
by maximizing the likelihood:
   n(d,w)
max P (w|t)P (t|d) (2.21)
∀w,t,P (w|t),∀t,d,P (t|d)
d∈D w∈W t∈T

with n(d, w) indicating the number of times word w has appeared in document d.
The log-likelihood assumes the following form:
   
L= max n(d, w) × log P (w|t)P (t|d) (2.22)
∀w,t,P (w|t),∀t,d,P (t|d)
d∈D w∈W t∈T
2 Multi-View Clustering 41

The document–topic probabilities and topic–word probabilities are typically


learnt alternatively in the E and M steps of an Expectation Maximization framework.
Though the above framework is designed for document datasets, it may be trivially
seen that it applies to any dataset if words are considered as attributes and n(d, w)
is interpreted as the weight associated with the feature w in data object d.

2.6.1 Collaborative PLSA for Clustering Two-View Data

Upon estimation of the topics, one could view the topics as clusters. Similar to the
NMF case, a trivial clustering is achieved with each document (i.e., data object)
assigned to the topic with which it has the highest affinity. Alternatively, the P (.|d)
vectors may be treated as unit L1 length representations, which could be subject to
a clustering method such as K-Means.
In [23], an extension of PLSA to two-view data is provided, with applications
to clustering. The idea is that the P (.|d) vectors derived from considering each of
the dataset views separately could be different, but a middleground could be found
by adding a soft constraint in order to bring the vectors together. This results in the
following objective function:

max w v1 Lv1 + wv2 Lv2 − β (Pv1 (t|d) − Pv2 (t|d))2
∀w,t,vi Pvi (w|t),∀t,d,vi Pvi (t|d)
d t
(2.23)
Here, Pvi (.|.) denotes the estimates arrived at for view vi . Thus, the objective
function is simply the weighted sum of the log-likelihoods for the two views
separately (denoted by Lvi ) discounted by the square of the L2 distance between
the P (.|d) vectors from either view, summed up over all documents. The view
weights wv are pre-specified and are expected to be part of the user input. The last
term forces the optimization to learn two separate document representations from
separate views, but forces that the representations are close together. The cluster
assignment for each document is arrived at using the max aggregation.

d.C = arg max{Pv1 (t|d), Pv2 (t|d)} (2.24)


t

The method, thought built upon topic modelling, has also been evaluated over
image datasets; refer [23] for details.

2.6.2 Voting-Based MVC

In another method [26] that builds upon PLSA topic modelling, a two-phase
approach is proposed. In the first phase, the separate-view datasets are subjected
42 Deepak P and A. Jurek-Loughrey

to separate PLSA topic modelling to arrive at separate view-specific topics; the


number of topics are kept the same across the views. This is followed by a second
phase, where documents that share similar Pv (.|d) vectors are pre-assigned to some
groups, these groups eventually feeding into the final clustering. The documents that
cannot be assigned to the groups are represented using a concatenated representation
derived by simply collating the view-specific representations. All the documents
(grouped as well as ungrouped) are then subjected to another round of topic
modelling, this time a cross-view one. Within the iterations, it is enforced that the
pre-assigned documents do not switch groups, whereas the other documents are
free to switch groups. This leads to a final cross-view P (.|d) vector, which is then
translated to a clustering.

2.7 Spectral Methods

Spectral Clustering (e.g., [35]) operates on a graph formed by the data objects
as nodes and the (usually weighted) edges representing similarities between data
points. As we saw earlier in the section on matrix factorization, a graph can be
induced from a dataset X by forming edges between objects if one of them figures
in the other’s k nearest neighbors; there could be other intuitive ways of translating
a relational dataset into a graph. Spectral clustering algorithms are favored in cases
where the desired clustering output involves irregular shapes. A simple spectral
clustering algorithm over a graph represented as a similarity matrix S may be
outlined as below:
– Graph Laplacian: Form the Laplacian matrix of the graph from the n × n
similarity matrix in the input, which we will denote as L. This would be an n × n
matrix.
– Eigen Decomposition: Perform eigen decomposition over L and choose the top-
k eigenvectors. Let the eigen vectors be stacked column-wise to form an n × k
matrix U .
– Normalization: The rows of the matrix U are normalized to form a normalized
matrix Û . Each row may now be treated as a representation of the corresponding
data point.
– Clustering: The normalized vectors can now be subjected to a conventional
clustering algorithm such as K-Means to arrive at a clustering.
In a fully connected graph, spectral clustering methods may be thought of as
solving a version of the min-cut problem, i.e., identifying a set of k connected
components by discarding a few edges. It is also notable here that the k eigen vectors
of the Laplacian matrix may be thought of as signatures of the k clusters in the
output. We will now consider MVC methods that build upon ideas from spectral
clustering.
2 Multi-View Clustering 43

2.7.1 Co-training Spectral Clustering Over Two-View Data

Co-training (e.g., [4]) proposes searching in two different hypothesis spaces (e.g.,
two different clusterings in two different data views, in the case of MVC) while
gradually swaying them towards each other, so that the eventual choices of
hypotheses agree reasonably well with each other.
Kumar and Daumé [27] blend co-training with spectral clustering in order to
devise an MVC solution for two-view data. The spectral clustering is solved in the
individual views separately, in order to arrive at two top-k eigen vector column-
stacked matrices U1 and U2 . This is followed by modifying the similarity matrix
S1 using information from U2 (i.e., the clustering exposited by the eigen vectors
in the second view) and modifying the similarity matrix S2 using information
from U1 . The modification is performed by projecting the columns of S1 (S2 ) on
the eigen vectors from U2 (U1 ) and re-projecting them back to the n-dimensional
space. The projection operation using S1 implicitly discards some information that
is not in line with the clustering information from U2 , and vice versa. The modified
S1 (S2 ) is then used in the next iteration for deriving U1 (U2 ). Each projection–
re-projection operation nudges the similarity matrix from one view towards the
clustering structure from the other view. Over the course of many iterations, the
clusterings from either view converge, leading to an MVC solution.
While extensive empirical results are presented for two-view data, the authors
also outline that the framework may be extended for data with more than two views
by performing the projection–re-projection operation using the aggregated eigen
vectors from all views (but for the one whose similarity matrix is being modified).

2.7.2 Pareto-Optimal Spectral MVC

The co-training approach that we just saw makes an implicit assumption that the two
views are compatible and agree on a single clustering of the data. However, there
may be cases where there are disagreements between data views. A relaxation of the
agreement assumption is used in order to devise another spectral MVC method [42].
Consider two data views v1 and v2 as earlier. Now, given the graphs induced by the
respective view-specific similarity matrices, each cut c of the graph (note that the
graph nodes are common across views, being the data points in parallel data) that
produces k connected components, may be costed for each view; the cost, as may be
intuitive, quantifies how much the view dislikes the cut. Let C(v, c) denote the cost
of the cut c over the view v. Now, a space of 2-d vectors may be defined as follows:

S = {(C(v1 , c), C(v2 , c)|c ∈ Θ} (2.25)

where Θ is the set of all nontrivial cuts over graphs across the two views. The
optimal cut for the first view, i.e., arg minc∈Θ C(v1 , c), may differ from the optimal
44 Deepak P and A. Jurek-Loughrey

cut for the second view when there are disagreements across the data views.
Thus, Wang et al. [42] propose searching for the set of pareto-optimal cuts, which
is defined as the skyline [5] of S. Towards defining the skyline, it is useful to
understand the domination operator. A vector (x, y) is said to dominate another
(x , y ) if and only if the following conditions are satisfied:

x ≤ x ∧ y ≤ y ∧ (x = x ∨ y = y ) (2.26)

Thus, (x , y ) is dominated as long as it is at least as far as (x, y) from the origin


on both axes, as long as both the vectors are not equal. The skyline of S are the set
of vectors that remain when all dominated vectors are filtered out. When vectors in
the cost-space S are considered, it amounts to finding vectors corresponding to a set
of cuts such that it is impossible to find another cut that has lower cost on both the
data views.
The set of pareto-optimal cuts could be very large, especially if the disagreements
between the data views are high. Each pareto-optimal cut yields a clustering of the
multi-view dataset. However, if one were interested in aggregating the set of pareto-
optimal cuts to form a single MVC result, the cluster indicator vectors corresponding
to each pareto-optimal cut could be subject to a further round of clustering.
The proposed method easily generalizes to more than two data views, with V
view datasets entailing a skyline search over V length vectors. It may, however, be
noted that the size of the pareto-optimal set usually increases quite fast with the
number of dimensions of the vectors involved.

2.8 Exemplar-Based Approaches

Exemplar-based approaches have been popular for clustering (e.g., [17]), with the
clustering process operationalized through message passing. Exemplars are a small
set of data points chosen from the dataset, each of which stand for a cluster in
the output. Each data point is associated with one of the data points in the set of
exemplars, and that association determines the cluster membership as well. The
clustering process is itself initialized by setting each data point as its own exemplar,
and gradually shrinking the set of exemplars through associating each data point
with a different exemplar typically based on both: (1) how similar the data point is,
to the exemplar, and (2) how much other data points also prefer being associated
with the exemplar. The latter factor ensures that the set of exemplars are shrunk
progressively in order to achieve a small set of clusters in the eventual output.
Prior information about suitabilities of particular data points to be exemplars is
easily incorporated by weighting the second criterion highly for such data points.
Exemplar-based clustering is attractive in that there is an automatic choice of
prototypical data object for each cluster, one that could act as a summary in
scenarios where human perusal of clusters is necessary. We describe two approaches
that generalize from exemplar-based clustering for MVC.
2 Multi-View Clustering 45

2.8.1 Simple Similarity Matrix Aggregation

Scientific journal papers may be treated as multi-view data comprising the text view,
and the citation profile. An exemplar-clustering approach designed for clustering
such scientific datasets, proposed in [34], simply takes the n × n similarity matrix
from the separate views and forms an aggregated similarity matrix using a weighted-
sum form:

S = αST + (1 − α)SC (2.27)

where ST and SC are the similarity matrices from the text and citation views, respec-
tively, and α controls the relative weighting between the two views. The aggregated
similarity matrix S is then subject to exemplar-based affinity propagation to arrive
at an MVC result.

2.8.2 Affinity Propagation with Cross-View Agreement

A more recent work [44] introduces a more sophisticated approach towards


exemplar-based clustering through incorporating an explicit criterion to enforce an
agreement between the clusterings from across views. We describe the objective
function they optimize for, leaving the interested reader to refer to the paper for
finer details of the message passing framework. Within a single view dataset, X, the
objective function for an exemplar approach may be written down as:

  −∞ x = cx ∧ (∃x ∈ X, x = cx )
J = S(x, cx ) + (2.28)
x∈X x∈X 0 otherwise

where cx denotes the exemplar of the data object x. The second term effectively
forces a data point that is an exemplar for a different data point, to be an exemplar
for itself as well, an intuitively motivated condition. The first term quantifies the
similarity between an object and its exemplar. Estimating the {. . . , cx , . . .} by
maximizing J would intuitively lead to a single-view clustering solution for X.
The MVC work [44] extends this framework for exemplar-based clustering by
adding all the view-specific J s along with a pairwise cross-view agreement term as
follows:

   |N v (xcv ) ∩ N v (xcv )|
α Jv + (1 − α) I (v = v ) k k
(2.29)
v∈V v∈V v ∈V x |Nkv (xcv ) ∪ Nkv (xcv )|

where Nkv (xcv ) denotes the set of k nearest neighbors to the exemplar of the data
point x within view v (note that each data point has a different exemplar within
46 Deepak P and A. Jurek-Loughrey

each view), when neighbors are computed based on the similarity matrix from
the view v. Thus, the last term is a Jaccard similarity term between the pairs of
neighborhoods across two different views, for view-specific exemplars associated
with the same data object. Estimating the view-specific exemplars for each data
object by maximizing this objective leads to a different clustering solution for each
view; however, they are expected to be in reasonable agreement, due to having
considered the cross-view neighborhood agreement in the optimization. The view-
specific clusterings may then be aggregated to arrive at a single MVC result.

2.9 Other Approaches to MVC

Having considered the major families of techniques for MVC, we now turn our
attention to some methods that cannot be easily categorized into any of the
previously discussed categories. We attempt to provide an illustrative sample of the
variety of approaches to MVC, over and above those covered earlier.

2.9.1 Multi-View Ensemble Clustering

Ensemble Clustering encompasses the family of clustering methods (e.g., [16])


that can fuse multiple clusterings into a single consensus clustering. There has
been recent work on spectral ensemble clustering [31] that suggests that multiple
clusterings of the same set of data X may be leveraged to form a co-association
matrix as follows:

|{C|C ∈ C ∧ x.C = x .C}|


Sxx = (2.30)
|{C|C ∈ C}|

Informally, Sxx captures the fraction of clusterings in C that puts x and


x in the same cluster. Now, by minimizing the objective that uses the graph
Laplacian LS of S:

min tr(H T LS H ) (2.31)


H

with the constraint H T H = I leads to an estimate of H that can be used to identify


a partitioning of data points in X such that the similarities in S are respected.
The multi-view extension [39] of this ensemble clustering method seeks to learn
a low-rank cross-view representation of the data objects as Z and an associated
H representing a cross-view clustering result. Towards ensuring that the low-rank
representation Z is in agreement with the various view-specific similarity matrices
Sv , a view-specific constraint is imposed, leading to an iterative optimization
2 Multi-View Clustering 47

formulation. We refer the interested reader to the paper for details of the full
objective function.

2.9.2 Co-clustering for Multi-View Datasets

Co-clustering (e.g., [12]) is a paradigm for clustering words and documents


simultaneously in document datasets. Adopting a variant of co-clustering, Hussain
and Bashir [20] consider learning document–document (i.e., pairwise) similarities
interleaved with learning of word–word similarities within a single-view document
dataset. This leads to an iterative algorithm that, when initialized with simple
similarities based on word and document co-occurrences, produces a refined
estimate of document–document similarities at the end, one that could be used to
form a clustering. Given a clustering, the ratio of intra-cluster document similarities
to document similarities across the corpus leads to an intrinsic (i.e., unsupervised)
goodness measure for the similarity matrix.
Further, Hussain and Bashir [20] propose to extend the single-view method to
multi-view datasets by using cross-view learning within each iteration to arrive at
an MVC method. In one variant, the better document–document similarity estimates
from across views (when evaluated using the intrinsic goodness measure against a
clustering) are chosen to feed into the next learning iteration. They provide two
additional variants that differ on how the information is fused across views within
the iteration.

2.9.3 Multi-View Clustering via Canonical Correlation


Analysis

Canonical Correlation Analysis (CCA [40]), which bears relations with principal
component analysis for single-view data, is a classical method for identifying
directions along which the separate views of a two-view dataset are correlated.
In an extension based on CCA, Livescu et al. [32] propose to identify the top-k
CCA directions and then project the separate datasets, i.e., Xv s, to those directions.
This is then followed by a conventional clustering on the transformed view-specific
datasets. Their work is built upon an independent assumption, whereby parallel
samples from the same cluster (i.e., two data samples belonging to the same object,
but from the separate views) can be regarded as independent given the cluster label.
This is intuitive for cases such as multimodal data, e.g., text and video, whereby
the dependence between the text and video from the same person may be largely
attributed to the identity of the person.
48 Deepak P and A. Jurek-Loughrey

2.10 Variations of the MVC Task

Having considered a variety of methods that address the MVC task, we now look at
a few variants of the MVC task that have been explored in the literature.

2.10.1 Multi-View Clustering with Unmapped Data

The conventional setting of MVC, the setting that has occupied all our attention
until now, has been the case where the multi-view data is parallel. Thus, the ith
row of Xv relates to the same object as the ith row of Xv . However, as observed
in [49], such comprehensively parallel data is not always available. As an example,
in certain cases, the cross-view linking information could be very sparse in that, only
information of presence or absence of linkages between a few pairs of data objects
may be available. Zhang et al. [49] consider the case where different views could
potentially consist of different sets of objects (and even varying number of objects
across views), and there are two sets of linkage information available, as follows:

ML = {(iv , jv )|1 ≤ v, v ≤ |V |} (2.32)

CL = {(iv , jv )|1 ≤ v, v ≤ |V |} (2.33)

Informally, ML and CL are sets of pairs of objects, which could be from across
views, which are deemed to must-link (i.e., should be part of the same cluster)
or cannot-link (analogously, should be in different clusters), respectively. If prior
information is available that two objects from across views—specifically, the ith
object from view v and the j th object from view v —are indeed associated with the
same underlying entity, the pair corresponding to them—i.e., (iv , jv )—may be then
added to the set ML. Analogously, two objects for whom information is available
that they are dissimilar, and possibly do not even belong to the same cluster, they
may then be added to the CL set.
The work [49] proposes modifying the NMF to account for these constraints as
follows:

  
k
||Xv − Gv Fv ||2F + (Gv [i][x] − Gv [j ][x])2
v∈V (iv ,jv )∈ML x=1

 
k
+2 × Gv [i][x] × Gv [j ][x] (2.34)
(iv ,jv )∈CL x=1

The first term is the usual NMF loss aggregated across views, whereas the second
and third terms relate to the ML and CL constraints, Gv [i][x] indicating the xth
2 Multi-View Clustering 49

element in the vector corresponding to the ith element from the coefficient matrix
(which, as we saw earlier, is the clustering indicator matrix for clustering scenarios).
The second term quantifies the dissimilarity between the objects involved in ML
constraints (that needs to be minimized), and the third term quantifies the similarity
between objects in CL constraints (that needs to be minimized as well). Each data
point is then associated with the cluster with which it has the highest coefficient in
the respective Gv matrix, leading to a clustering output.

2.10.2 Multi-Task Multi-View Clustering

Multi-task clustering is the setting where multiple-related tasks are to be addressed


together in order to achieve a better effectiveness. For example, clustering web
images from Chinese and English websites may be considered as two related
tasks. The images trivially come from the same space, that of images; however,
the text surrounding the images, which may hold valuable cues to arrive at a
clustering solution, come from different spaces, that of Chinese and English words,
respectively. Within each task, the relationship between the component views is
to be ensured in that the clustering they exposit needs to be consistent. On the
other hand, within each view that is shared across tasks, the separate tasks should
use the same notion of similarity. Multi-task Multi-view Clustering [48] considers
accomplishing the two tasks of multi-task and multi-view learning for the clustering
setting, within the same framework. This utilizes the shared data objects across
views, and the shared views across tasks, in order to mutually enhance the accuracies
of the task-specific results. They compartmentalize the overall problem into three
separate considerations as follows:
– Within-view-task Clustering: For a combination of a chosen view and chosen
task, the solution to the clustering problem may be achieved by a co-clustering
method that simultaneously partitions both the data objects and the attributes
under consideration. Their co-clustering method arrives at two sets of eigen
vectors, one that indicates a partitioning over the attributes, and a second that
indicates a partitioning over the data objects.
– Multi-view Relationship Learning: Consider a specific task; now, across the
different views associated with the task, we expect to see a consistency across
the partitioning of data objects. This may be achieved by incentivizing for the
similarity matrices (with each entry indicating pairwise object similarities) to be
the same across the separate data partitioning models (i.e., eigenvectors) learnt
from across the tasks.
– Multi-task Relationship Learning: Analogous to the previous consideration,
this considers learning a shared subspace across related tasks, which is confor-
mant to the data partitioning structure that entails from the eigenvectors learnt
for the view-task combinations.
50 Deepak P and A. Jurek-Loughrey

Each of the above considerations are modelled in an objective function, leading to


an overall objective function for multi-view multi-task clustering. As in the previous
cases, these separate considerations are optimized for an iterative framework, finally
arriving at an multi-task MVC result.

2.11 Datasets for Multi-View Clustering

We now consider a few datasets for MVC that have been used in putting forward the
empirical case for the various MVC methods that have appeared in the literature.
Table 2.2 lists 27 datasets that have been used in order to evaluate MVC methods.

Table 2.2 Listing of a few datasets used for MVC


Dataset |V | Remarks References
Reuters ML 6 Same text document in multiple natural languages [30]
UCI Digits 2 Handwritten digits: views are Fourier coefficients [27]
Profile Correlations
IMDB 2 Movies represented across actors and keywords [20]
CiteSeer 2 Scholarly articles with text and citation views [20]
Cora 2 Scholarly articles with text and citation views [20]
Cornell 2 Webpages with document and link views [20]
Coral5k 2 Images with RGB histogram and SIFT views [23]
Caltech101-7 6 Images with different types of image features [46]
MSRC 6 Images with different types of image features [46]
Handwritten 6 Images with different types of image features [46]
WebKB 3 Web pages with text, inward anchor text, and title views [48]
NUS-WIDE 7 Images with tags and six views of low-level features [41]
VidTIMIT 2 Audio and video of a person speaking a sentence [32]
Wikipedia 2 Text and inward/outward links [32]
3-Sources 3 News Stories from BBC, Reuters, and The Guardian [30]
ALOI 4 Four sets of image features [24]
Pascal VOC 2 Images with color and bow features as views [43]
SensIT 2 Transportation data with acoustic and seismic views [7]
CQADupStack 2 Text data with question and answer views [9]
Animal 6 Six sets of image features [7]
SUN 397 7 Seven sets of image features [7]
Water Treatment 4 Four sets of features of water treatment plants [8]
Yeast Cell Cycle 5 Five sets of features from microarray data [8]
Internet Ads 6 Image and text features associated with internet images [8]
Yale 3 Three sets of image features [50]
Notting-Hill 3 Three sets of image features from video face images [50]
Oxford Flowers 4 Four sets of image features [45]
SemEval2016-Task3 2 Text data with questions and comments as views [10]
2 Multi-View Clustering 51

Many of the datasets have been used across multiple papers; however, for brevity,
we just list one reference against each dataset. As may be seen from the table, the
datasets comprise a wide variety of data types; text, links (e.g., weblinks, anchor
texts, and citations), and image datasets. It is particularly worthy to note that many
datasets have only two views, whereas a few datasets have as many as 6–7 views.
However, even the six and seven views have been arrived at by partitioning features
into sets of related features, rather than those views being intrinsic in the data
representation itself. One may infer from the list that multi-view learning would
benefit from the availability of more diverse datasets, such as those that come from
varying domains, and those that organically have many views.

2.12 Conclusions

In this chapter, we have considered various families of approaches that have been
explored for addressing the multi-view clustering problem. Starting with a formal
definition of the problem, we considered the different formulations that have been
employed in multi-view clustering, considering them in clusters of approaches based
on similarities in their methodology. We then looked at some variations of the MVC
setting that have been considered in the literature. We wish to re-emphasize here
that our focus has not been to comprehensively cover techniques developed for
multi-view clustering for there are too many of them; instead, we have chosen
a few illustrative approaches from each school of methods in order to provide a
diversified birds-eye view of methods for the task. Finally, we also listed a set
of popular datasets that have been used for benchmarking and evaluating MVC
algorithms. Our impetus has been to provide information in an accessible form,
so that readers who may not be familiar with the mathematical details of specific
machine learning building blocks would also be able to comprehend and utilize this
chapter for scenarios such as choosing a particular MVC method for addressing a
task at hand.

References

1. Balachandran, V., Deepak, P., Khemani, D.: Interpretable and reconfigurable clustering of
document datasets by deriving word-based rules. Knowl. Inf. Syst. 32(3), 475–503 (2012)
2. Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM, vol. 4, pp. 19–26 (2004)
3. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
4. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100. ACM,
New York (1998)
5. Borzsony, S., Kossmann, D., Stocker, K.: The skyline operator. In: 2001 Proceedings of the
17th International Conference on Data Engineering, pp. 421–430. IEEE, Piscataway (2001)
52 Deepak P and A. Jurek-Loughrey

6. Cai, D., He, X., Han, J., Huang, T.S.: Graph regularized nonnegative matrix factorization for
data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1548–1560 (2011)
7. Cai, X., Nie, F., Huang, H.: Multi-view k-means clustering on big data. In: IJCAI, pp. 2598–
2604 (2013)
8. Chen, X., Xu, X., Huang, J.Z., Ye, Y.: Tw-k-means: automated two-level variable weighting
clustering algorithm for multiview data. IEEE Trans. Knowl. Data Eng. 25(4), 932–944 (2013)
9. Deepak, P.: Mixkmeans: clustering question-answer archives. In: Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, pp. 1576–1585 (2016)
10. Deepak, P., Garg, D., Shevade, S.: Latent space embedding for retrieval in question-answer
archives. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pp. 855–865 (2017)
11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)
12. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning.
In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 269–274. ACM, New York (2001)
13. Ding, C., He, X., Simon, H.D.: Nonnegative Lagrangian relaxation of K-means and spectral
clustering. In: European Conference on Machine Learning. pp. 530–538. Springer, Berlin
(2005)
14. Ding, C.H., Li, T., Jordan, M.I.: Convex and semi-nonnegative matrix factorizations. IEEE
Trans. Pattern Anal. Mach. Intell. 32(1), 45–55 (2010)
15. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-
wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)
16. Fred, A.L., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE
Trans. Pattern Anal. Mach. Intell. 27(6), 835–850 (2005)
17. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814),
972–976 (2007)
18. Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2),
85–126 (2004)
19. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Confer-
ence on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc.,
San Francisco (1999)
20. Hussain, S.F., Bashir, S.: Co-clustering of multi-view datasets. Knowl. Inf. Syst. 47(3), 545–
570 (2016)
21. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666
(2010)
22. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs
(1988)
23. Jiang, Y., Liu, J., Li, Z., Lu, H.: Collaborative PLSA for multi-view clustering. In: 2012 21st
International Conference on Pattern Recognition (ICPR), pp. 2997–3000. IEEE, Piscataway
(2012)
24. Jiang, B., Qiu, F., Wang, L.: Multi-view clustering via simultaneous weighting on views and
features. Appl. Soft Comput. 47, 304–315 (2016)
25. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting K-means algorithm for subspace
clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041
(2007)
26. Kim, Y.M., Amini, M.R., Goutte, C., Gallinari, P.: Multi-view clustering of multilingual
documents. In: Proceedings of the 33rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 821–822. ACM, New York (2010)
27. Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Pro-
ceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 393–400
(2011)
28. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization.
Nature 401(6755), 788 (1999)
2 Multi-View Clustering 53

29. Liao, T.W.: Clustering of time series data—a survey. Pattern Recogn. 38(11), 1857–1874
(2005)
30. Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix
factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining,
pp. 252–260. SIAM, Philadelphia (2013)
31. Liu, H., Liu, T., Wu, J., Tao, D., Fu, Y.: Spectral ensemble clustering. In: Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 715–724. ACM, New York (2015)
32. Livescu, K., Sridharan, K., Kakade, S., Chaudhuri, K.: Multi-view clustering via canonical
correlation analysis. In: NIPS Workshop: Learning from Multiple Sources (2008)
33. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations.
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,
Oakland, vol. 1, pp. 281–297 (1967)
34. Meng, X., Liu, X., Tong, Y., Glänzel, W., Tan, S.: Multi-view clustering with exemplars for
scientific mapping. Scientometrics 105(3), 1527–1552 (2015)
35. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In:
Advances in Neural Information Processing Systems, pp. 849–856 (2002)
36. Nie, F., Huang, H., Cai, X., Ding, C.H.: Efficient and robust feature selection via joint 2, 1-
norms minimization. In: Advances in Neural Information Processing Systems, pp. 1813–1821
(2010)
37. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding.
Science 290(5500), 2323–2326 (2000)
38. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach.
Intell. 22(8), 888–905 (2000)
39. Tao, Z., Liu, H., Li, S., Ding, Z., Fu, Y.: From ensemble clustering to multi-view clustering.
In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
(IJCAI), pp. 2843–2849 (2017)
40. Thompson, B.: Canonical correlation analysis. In: Encyclopedia of Statistics in Behavioral
Science. Wiley, West Sussex (2005)
41. Wang, H., Nie, F., Huang, H.: Multi-view clustering and feature learning via structured sparsity.
In: International Conference on Machine Learning, pp. 352–360 (2013)
42. Wang, X., Qian, B., Ye, J., Davidson, I.: Multi-objective multi-view spectral clustering via
pareto optimization. In: Proceedings of the 2013 SIAM International Conference on Data
Mining, pp. 234–242. SIAM, Philadelphia (2013)
43. Wang, D., Yin, Q., He, R., Wang, L., Tan, T.: Multi-view clustering via structured low-rank
representation. In: Proceedings of the 24th ACM International on Conference on Information
and Knowledge Management, pp. 1911–1914. ACM, New York (2015)
44. Wang, C.D., Lai, J.H., Philip, S.Y.: Multi-view clustering based on belief propagation. IEEE
Trans. Knowl. Data Eng. 28(4), 1007–1021 (2016)
45. Wang, Y., Chen, L., Li, X.L.: Multiple medoids based multi-view relational fuzzy clustering
with minimax optimization. In: Proceedings of the Twenty-Sixth International Joint Confer-
ence on Artificial Intelligence, pp. 2971–2977 (2017)
46. Xu, J., Han, J., Nie, F.: Discriminatively embedded K-means for multi-view clustering. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–
5364 (2016)
47. Xu, Y.M., Wang, C.D., Lai, J.H.: Weighted multi-view clustering with feature selection. Pattern
Recogn. 53, 25–35 (2016)
48. Zhang, X., Zhang, X., Liu, H.: Multi-task multi-view clustering for non-negative data. In:
IJCAI, pp. 4055–4061 (2015)
49. Zhang, X., Zong, L., Liu, X., Yu, H.: Constrained NMF-based multi-view clustering on
unmapped data. In: AAAI, pp. 3174–3180 (2015)
50. Zhao, H., Ding, Z., Fu, Y.: Multi-view clustering via deep matrix factorization. In: AAAI,
pp. 2921–2927 (2017)
51. Zong, L., Zhang, X., Zhao, L., Yu, H., Zhao, Q.: Multi-view clustering via multi-manifold
regularized non-negative matrix factorization. Neural Netw. 88, 74–89 (2017)

You might also like