VDSH
VDSH
As the amount of textual data has been rapidly increasing over The task of similarity search, also known as nearest neighbor search,
the past decade, efficient similarity search methods have become proximity search, or close item search, is to find similar items given
a crucial component of large-scale information retrieval systems. a query object [35]. It has many important information retrieval
A popular strategy is to represent original data samples by com- applications such as document clustering, content-based retrieval,
pact binary codes through hashing. A spectrum of machine learn- and collaborative filtering [33]. The rapid growth of Internet has
ing methods have been utilized, but they often lack expressiveness resulted in massive textual data in the recent decades. In addi-
and flexibility in modeling to learn effective representations. The tion to the cost of storage, searching for relevant content in gi-
recent advances of deep learning in a wide range of applications gantic databases is even more daunting. Traditional text similarity
has demonstrated its capability to learn robust and powerful fea- computations are conducted in the original vector space and could
ture representations for complex data. Especially, deep generative be prohibitive to use for large-scale corpora since these methods
models naturally combine the expressiveness of probabilistic gen- are involved with high cost of numerical computation in the high-
erative models with the high capacity of deep neural networks, dimensional spaces.
which is very suitable for text modeling. However, little work has Many research efforts have been devoted to approximate sim-
leveraged the recent progress in deep learning for text hashing. ilarity search that is shown to be useful for practical problems.
In this paper, we propose a series of novel deep document gen- Hashing [5, 28, 38] is an effective solution to accelerate similarity
erative models for text hashing. The first proposed model is un- search by designing compact binary codes in a low-dimensional
supervised while the second one is supervised by utilizing docu- space so that semantically similar documents are mapped to sim-
ment labels/tags for hashing. The third model further considers ilar codes. This approach is much more efficient in memory and
document-specific factors that affect the generation of words. The computation. A binary representation of each document often only
probabilistic generative formulation of the proposed models pro- needs 4 or 8 bytes to store, and thus a large number of encoded doc-
vides a principled framework for model extension, uncertainty es- uments can be directly loaded into the main memory. Computing
timation, simulation, and interpretability. Based on variational in- similarity between two documents can be accomplished by using
ference and reparameterization, the proposed models can be inter- bitwise XOR operation which takes only one CPU instruction. A
preted as encoder-decoder deep neural networks and thus they are spectrum of machine learning methods have been utilized in hash-
capable of learning complex nonlinear distributed representations ing, but they often lack expressiveness and flexibility in modeling,
of the original documents. We conduct a comprehensive set of ex- which prevents them from learning compact and effective repre-
periments on four public testbeds. The experimental results have sentations of text documents.
demonstrated the effectiveness of the proposed supervised learn- On the other hand, deep learning has made tremendous progress
ing models for text hashing. in the past decade and has demonstrated impressive successes in a
variety of domains including speech recognition, computer vision,
CCS CONCEPTS and natural language processing [18]. One of the main purposes
•Information systems →Information retrieval; •Computing of deep learning is to learn robust and powerful feature represen-
methodologies →Neural networks; Learning latent repre- tations for complex data. Recently, deep generative models with
sentations; variational inference [14, 27] have further boosted the expressive-
ness and flexibility for representation learning by integrating deep
KEYWORDS neural nets into the probabilistic generative framework. The seam-
less combination of generative modeling and deep learning makes
Semantic hashing; Variational autoencoder; Deep learning them suitable for text hashing. However, to the best of our knowl-
edge, no prior work has leveraged them for hashing tasks.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
In this paper, we propose a series of novel deep document gen-
for profit or commercial advantage and that copies bear this notice and the full cita- erative models for text hashing, inspired by variational autoen-
tion on the first page. Copyrights for components of this work owned by others than coder (VAE) [14, 27]. The proposed models are the marriage of
the author(s) must be honored. Abstracting with credit is permitted. To copy other-
wise, or republish, to post on servers or to redistribute to lists, requires prior specific deep learning and probabilistic generative models [1]. They enjoy
permission and/or a fee. Request permissions from [email protected]. the good properties of both learning paradigms. First, with the
SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan deep neural networks, the proposed models can learn flexible non-
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-5022-8/17/08. . . $15.00 linear distributed representations of the original high-dimensional
DOI: http://dx.doi.org/10.1145/3077136.3080816 documents. This allows individual codes to be fairly general and
concise but their intersection to be much more precise. For exam- [2] is one of the most popular hashing methods with interesting as-
ple, nonlinear distributed representations allow the topics/codes ymptotic theoretical properties leading to performance guarantees.
“government,” “mafia,” and “playboy” to combine to give very high While LSH is a data-independent hashing method, many hashing
probability to the word “Berlusconi,” which is not predicted nearly methods have been recently proposed to leverage machine learn-
as strongly by each topic/code alone. ing techniques with the goal of learning data-dependent hash func-
Meanwhile, the proposed models are probabilistic generative tions, ranging from unsupervised and supervised to semi-supervised
models and thus there exists an underlying data generation process settings. Unsupervised hashing methods attempt to integrate the
characterizing each model. The probabilistic generative formula- data properties, such as distributions and manifold structures to
tion provides a principled framework for model extensions such design compact hash codes with improved accuracy. For instance,
as incorporating supervisory signals and adding private variables. Spectral Hashing (SpH) [38] explores the data distribution by pre-
The first proposed model is unsupervised and can be interpreted as serving the similarity between documents by forcing the balanced
a variant of variational autoencoder for text documents. The other and uncorrelated constraints into the learned codes, which can be
two models are supervised by utilizing the document label/tag in- viewed as an extension of spectral clustering [25]. Graph hashing
formation. The prior work in the literature [36] has demonstrated [21] utilizes the underlying manifold structure of data captured by
that the supervisory signals are crucial to boost the performance a graph representation. Self Taught Hashing (STH) [41] is the state-
of semantic hashing for text documents. The third model further of-the-art hashing method by decomposing the learning procedure
adds a private latent variable for documents to capture the infor- into two steps: generating binary code and learning hash function.
mation only concerned with the documents but irrelevant to the Supervised hashing methods attempt to leverage label/tag in-
labels, which may help remove noises from document representa- formation for hash function learning. It has attracted more and
tions. Furthermore, specific constraints can be enforced by making more attention in recent years. For example, Wang et al. [36] pro-
explicit assumptions in the models. One desirable property of hash pose Semantic Hashing using Tags and Topic Modeling (SHTTM)
code is to ensure the bits are uncorrelated so that the next bit can- to incorporate tags to obtain more effective hashing codes via a
not be predicted based on the previous bits [38]. To achieve this matrix factorization formulation. To utilize the pairwise supervi-
property, we can just assume that the latent variable has a prior sion information in the hash function learning, Kernel-Based Su-
distribution with independent dimensions. pervised Hashing (KSH) proposed in [20] used a pairwise relation-
In sum, the probabilistic generative formulation provides a prin- ship between samples to achieve high-quality hashing. Binary Re-
cipled framework for model extensions, interpretability, uncertainty constructive Embedding (BRE) [15] was proposed to learn hash
estimation, and simulation, which are often lacking in deep learn- functions by minimizing the reconstructed error between the met-
ing models but useful in text hashing. The main contributions of ric space and Hamming space. Moreover, there are also several
the paper can be summarized as follow: works using the ranking order information to design hash func-
tions. Ranking-based Supervised Hashing (RSH) [34] was proposed
• We proposed a series of unsupervised and supervised deep to leverage listwise supervision into the hash function learning
document generative models to learn compact representa- framework. Semi-supervised learning paradigm was also employed
tions for text documents. To the best of our knowledge, to design hash functions by using both labeled and unlabeled data
this is the first work that utilizes deep generative models [32]. The hashing-code learning problem is essentially a discrete
with variational inference for text hashing. optimization problem which is difficult to solve. Most existing su-
• The proposed models enjoy both advantages of deep learn- pervised hashing methods try to solve a relaxed continuous opti-
ing and probabilistic generative models. They can learn mization problem and then threshold the continuous representa-
complex nonlinear distributed representations of the orig- tion to obtain a binary code. Abundant related work, especially on
inal high-dimensional documents while providing a prin- image hashing, exists in the literature. Two recent surveys [33, 35]
cipled framework for probabilistic reasoning. provide a comprehensive literature review.
• We derived tractable variational lowerbounds for the pro-
posed models and reparameterize the models so that back-
propagation can be applied for efficient parameter estima- 2.2 Deep Learning
tion. Deep learning has drawn increasing attention and research efforts
• We conducted a comprehensive set of experiments on four in a variety of artificial intelligence areas including speech recogni-
public testbeds. The experimental results demonstrate sig- tion, computer vision, and natural language processing. Since one
nificant improvements in our supervised models over sev- main purpose of deep learning is to learn robust and powerful fea-
eral well-known semantic hashing baselines. ture representations for complex data, it is very natural to leverage
deep learning for exploring compact hash codes which can be re-
2 RELATED WORK garded as binary representations of data. Most of the related work
has focused on image data [4, 16, 19, 39] rather than text docu-
2.1 Hashing ments probably due to the effectiveness of the convolution neural
Due to computational and storage efficiencies of compact binary networks (CNNs) to learn good low-dimensional representations
codes, hashing methods have been widely used for similarity search, of images. The typical deep learning architectures for hash func-
which is an essential component in a variety of large-scale informa- tion learning consist of CNNs layers for representation learning
tion retrieval systems [33, 35]. Locality-Sensitive Hashing (LSH) and hash function layers which then transform the representation
to supervisory signals. The loss functions could be pointwise [19], each document. Let d ∈ RV be the bag-of-words representation of
pairwise [4], or listwise [16]. a document and wi ∈ {0, 1}V be the one-hot vector representation
Some recent works have applied deep learning for several IR of the i th word of the document where V is the vocabulary size. d
tasks such as ad-hoc retrieval [9], web search [11], and ranking could be represented by different term weighting schemes such as
pairs of short texts [29]. However, very few has investigated deep binary, TF, and TFIDF [23]. The document generative process can
learning for text hashing. The representative work is semantic be described as follows:
hashing [28]. It builds a stack of restricted Boltzmann machines • For each document d,
(RBMs) [10] to discover hidden binary units which can model in- – Draw a latent semantic vector s ∼ P(s) where P(s) =
put text data (i.e., word-count vectors). After learning a multilayer N (0, I ) is the standard Gaussian distribution.
RBM through pretraining and fine tuning on a collection of docu- – For the i th word in the document,
ments, the hash code of any document is acquired by simply thresh- ∗ Draw wi ∼ P(wi | f (s; θ)).
olding the output of the deepest layer. A recent work [40] exploited
convolutional neural network for text hashing, which relies on ex- The conditional probability over words wi is modelled by multino-
ternal features such as the GloVe word embeddings to construct mial logistic regression and shared across documents as below:
text representations. exp(wiT f (s; θ))
Recently, deep generative models have made impressive progress P(wi | f (s; θ)) = ÍV (1)
T
with the introduction of the variational autoencoders (VAEs) [14, j=1 exp(w j f (s; θ))
27] and Generative Adversarial Networks (GANs) [8]. VAEs are es-
While P(s) is a simple Gaussian distribution, any distribution
pecially an appealing framework for generative modeling by cou-
can be generated by mapping the simple Gaussian through a suf-
pling the approach of variational inference [31] with deep learning.
ficiently complicated function [3]. Thus, f (s; θ)) is such a highly
As a result, they enjoy the advantages of both deep learning and
flexible function approximator usually a neural network. In other
probabilistic graphical models. Deep generative models parame-
words, we can learn a function which maps our independent, normally-
terized by neural networks have achieved state-of-the-art perfor-
distributed s values to whatever latent semantic variables might
mance in unsupervised and supervised learning [13, 14, 24]. To the
be needed for the model, and then generate the word wi . However,
best of our knowledge, our proposed models are the first work that
introducing a highly nonlinear mapping from s to wi results in in-
utilizes variational inference with deep learning for text hashing. ∫
tractable data likelihood s P(d |s)P(s)ds and thus intractable pos-
It is worth pointing out that both semantic hashing with stacked
terior distribution P(s |d) [14]. Similar to VAE, we use an approx-
RBMs [28] and our models are deep generative models, but the
imation Q(s |d;ϕ) for the true posterior distribution. By applying
former is undirected graphical models, and the latter is directed
the variational inference principle [31], we can obtain the follow-
models. The underlying generative process of directed probabilis-
ing tractable lowerbound of the document log likelihood (see [14]
tic models makes them easy to interpret and extend. The proposed
and Appendix):
models are very scalable since they are trained as deep neural net-
works by efficient backpropagation, while the stacked RBMs are
often much harder to train [10]. N
Õ
L1 = EQ log P(wi | f (s; θ)) − D K L (Q(s |d; ϕ) k P(s)) (2)
i =1
3 VARIATIONAL DEEP SEMANTIC HASHING
where N is the number of words in the document and D K L ( k )
This section presents three novel deep document generative mod-
is the Kullback-Leibler (KL) divergence between the approximate
els to learn low-dimensional semantic representations of documents
posterior distribution Q(s |d; ϕ) and the prior P(s). The variational
for text hashing. In Section 3.1, we introduce the basic model
distribution Q(s |d; ϕ) acts as a proxy to the true posterior P(s |d).
which is essentially a variational autoencoder for text modeling.
To enable a high capacity, it is assumed to be a Gaussian N (µ, diag(σ 2 ))
Section 3.2 extends the model to utilize label information to learn
whose mean µ and variance σ 2 are the output of a highly nonlin-
a more sensible representation. Section 3.3 further incorporates
ear function of d denoted as д(d; ϕ) parameterized by ϕ, once again
document private variables to model document-specific informa-
typically a neural network.
tion. Based on the variational inference, all the three models can
In training, the variational lowerbound in Eqn.(2) is maximized
be viewed as having an encoder-decoder neural network architec-
with respect to the model parameters. Since P(s) is a standard
ture where the encoder compresses a high-dimensional document
Gaussian prior, the KL Divergence D K L (Q(s |d; ϕ) k P(s)) in Eqn.(2)
to a compact latent semantic vector and the decoder reconstructs
can be computed analytically. The first term EQ can be viewed as
the document (or the labels). Section 3.4 discusses two threshold-
an expected negative reconstruction error of the words in the docu-
ing methods to convert the continues latent vector to a binary code
ment and it can be computed based on the Monte Carlo estimation
for text hashing.
[7].
Based on Eqn.(2), we can interpret VDSH as a variational au-
3.1 Unsupervised Learning (VDSH) toencoder with discrete output: a feedforward neural network en-
In this section, we present the basic variational deep semantic hash- coder Q(s |d; ϕ) compresses document representations into contin-
uous hidden vectors, i.e., d → s; a softmax decoder iN=1 P(wi | f (s; θ))
Í
ing (VDSH) model for the unsupervised learning setting. VDSH is
a probabilistic generative model of text which aims to extract a reconstructs the documents by independently generating the words
continuous low-dimensional semantic representation s ∈ RK for s → {wi }iN=1 . Figure 1(a) illustrates the architecture of VDSH. In
w y w y
w
Decoder
s s v s
σ µ σ µ σv µv σs µs
Encoder
d
(a) (b) (c)
Figure 1: Architectures of (a) VDSH, (b) VDSH-S, and (c) VDSH-SP. The dashed line represents a stochastic layer.
the experiments, we use the following specific architecture for the In this section, we extend VDSH to the supervised setting with the
encoder and decoder. new model denoted as VDSH-S. The probabilistic generative pro-
Encoder Q(s |д(d; ϕ)) : Decoder P(wi | f (s; θ)) : cess of a document with labels is as follows:
t 1 = ReLU(W1d + b 1 ) ci = exp(−s T Gwi + bw i ) • For each document d,
– Draw a latent semantic vector s ∼ P(s) where P(s) =
t 2 = ReLU(W2t 1 + b 2 ) ci
P(wi |s) = ÍV N (0, I ) is the standard Gaussian distribution.
µ = W3 t 2 + b 3 c
k=1 k – For the i th word in the document,
log σ = W4t 2 + b 4 N
Ö ∗ Draw wi ∼ P(wi | f (s; θ)).
2 P(d |s) = P(wi |s) – For the j th label in the label set,
s ∼ N (µ(d), diag(σ (d))) i =1 ∗ Draw y j ∼ P(y| f (s; τ )).
This architecture is similar to the one presented in VAE [27] except where y j ∈ {0, 1} L is the one-hot representation of the label j in
that VDSH has the softmax layer to model discrete words while the label set and L is the total number of possible labels (the size of
VAE is proposed to model images as continuous output. Here, the the label set). Let us use Y ∈ {0, 1} L represent the bag-of-labels of
encoder has two Rectified Linear Unit (ReLU) [7] layers. ReLU gen- the document (i.e., if the document has label j, the j th dimension
erally does not face gradient vanishing problem as with other ac- of Y is 1; otherwise, it is 0). VDSH-S assumes that both words and
tivation functions. Also, it has been shown that deep neural net- labels are generated based on the same latent semantic vecotor.
works can be trained efficiently using ReLU even without pretrain- We assume a general multi-label classification setting where
ing [7]. each document could have multiple labels/tags. P(y j | f (s; τ )) can
In this architecture, there is a stochastic layer which is to sam- be modeled by the logistic function as follows:
ple s from a Gaussian distribution N (µ(d), diag(σ 2 (d))), as repre-
sented by the dashed lines in the middle of the networks in Figure 1
P(y j | f (s; τ )) = (3)
1. Backpropagation cannot handle stochastic layer within the net- 1 + exp(−yTj f (s; τ ))
work. In practice, we can leverage the “location-scale” property
of Gaussian distribution, and use the reparameterization trick [14] Similar to VDSH, f (s; τ ) is parameterized by a neural network with
to turn the stochastic layer of s to be deterministic. As a result, the parameter τ so that we can learn an effective mapping from the
the encoder Q(s |d;ϕ) and decoder P(wi | f (s; θ)) form an end-to- latent semantic vector to the labels. The lowerbound of the data log
end neural network and are then trained jointly by maximizing the likelihood can be similarly derived and shown as follows:
variational lowerbound in Eqn.(2) with respect to their parameters N L
Õ Õ
by the standard backpropagation algorithm [7].
L2 = EQ log P(wi | f (s; θ)) + log P(y j | f (s; τ ))
i =1 j=1
3.2 Supervised Learning (VDSH-S) −D K L (Q(s |d, Y ; ϕ) k P(s)) (4)
In many real-world applications, documents are often associated
with labels or tags which may provide useful guidance in learn- Compared
ÍL to Eqn.(2) in VDSH, this lowerbound has an extra term,
ing effective hashing codes. Document content similarity in the EQ j=1 logP(y j | f (s; τ )) , which can be computed in a similar
ÍN
original bag-of-word space may not fully reflect the semantic rela- way with EQ i =1 log P(w i | f (s; θ)) in Eqn.(2), by using the Monte
tionship between documents. For example, two documents in the Carlo estimation. In addition, we can drop the dependence on vari-
same category may have low document content similarity due to able Y in the variational distribution Q(s |d, Y ; ϕ) since we may not
the vocabulary gap, while their semantic similarity could be high. have the label information available for new documents.
The architecture of the VDSH-S model is shown in Figure 1(b). It for binarizing the p th bit to be the median of the p th dimension of
consists of a feedforward neural network encoder of a document s in the training data. If the p th bit of document latent semantic
d → s and a decoder of the words and labels of the document vector µ new is larger than the median, the p th binary code is set to
s → {wi }iN=1 ; {y j } j=1
L . It is worth pointing out that the labels still
1; otherwise, it is set to -1. Another popular thresholding method
affect the learning of latent semantic vector by their presence in is to use the Sign function on µ new , i.e., if the p th bit of µ new is
the decoder despite their absence in the encoder. By using the nonnegative, the corresponding code is 1; otherwise, it is -1. Since
reparameterization trick, the model becomes a deterministic deep the prior distribution of the latent semantic vector is zero mean,
neural network and the lowerbound in Eqn.(4) can be maximized the Sign function is also a reasonable choice. We use the median
by backpropagation (see Appendix). thresholding as the default method in our experiments, while also
investigate the Sign function in Section 5.3.
3.3 Document-specific Modeling (VDSH-SP)
VDSH-S assumes both document and labels are generated by the 3.5 Discussions
same latent semantic vector s. In some cases, this assumption may The computational complexity of VDSH for a training document
be restrictive. For example, the original document may contain is O(BD 2 + DSV ). Here, O(BK 2) is the cost of the encoder, where
information that is irrelevant to the labels. It could be difficult to B is the number of the layers in the encoder network and D is
find a common representation for both documents and labels. This the average dimension of these layers. O(DNV ) is the cost of the
observation motivates us to introduce a document private variable decoder, where S is the average length of the documents and V
v, which is not shared by the labels Y . The generative process is is the vocabulary size. The computational complexity of VDSH-S
described as follows: and VDSH-SP is O(BD 2 + DS(V + L)) where L is the size of the
• For each document d, label set. The computational cost of the proposed models is at the
– Draw a latent semantic vector s ∼ P(s) where P(s) = same level as the deterministic autoencoder. Model learning could
N (0, I ) is the standard Gaussian distribution. be quite efficient since the computations of all the models can be
– Draw a latent private vector v ∼ P(v) where P(v) = parallelized in GPUs, and only one sample is required during the
N (0, I ) is the standard Gaussian distribution. training process.
– For the i th word in the document, The proposed deep generative model has a few desirable proper-
∗ Draw wi ∼ P(wi | f (s + v; θ)). ties for text hashing. First of all, it has the capacity of deep neural
– For the j th label in the label set, networks to learn sophisticated semantic representations for text
∗ Draw y j ∼ P(y| f (s; τ )). documents. Moreover, being generative models brings huge ad-
vantages over other deep learning models such as Convolutional
As we can see, s models the shared information between docu-
Neural Network (CNN) because the underlying document genera-
ment and labels while v only contains the document-specific in-
tive process makes the model assumptions explicit. For example,
formation. We can view adding private variables as removing the
as shown in [38], it is desirable to have independent feature di-
noise from the original content that is irrelevant to the labels. With
mensions in hash codes. To achieve this, our models just need
the added private variable, we denote this model as VDSH-SP. The
to assume the latent semantic vector is drawn from a prior dis-
tractable variational lowerbound of data likelihood can be derived
tribution with independent dimensions (e.g., standard Gaussian).
as follows:
The probabilistic approach also provides a principled framework
for model extensions as evident in VDSH-S and VDSH-SP. Fur-
N L thermore, instead of learning a particular latent semantic vector,
Õ Õ
L3 = EQ log P(wi | f (s + v; θ)) + log P(y j | f (s; τ )) our models learn probability distributions of the semantic vector.
i =1 j=1
This can be viewed as finding a region instead of a fixed point
−D K L (Q(s |d; ϕ) k P(s)) − D K L (Q(v |d; ϕ) k P(v)) (5) in the latent space for document representation, which leads to
Similar to the other two models, VDSH-SP can be viewed as a more robust models. Compared with other deep generative models
deep neural network by applying variational inference and reparametriza-such as stacked RBMs and GANs, our models are computationally
tion. The architecture is shown in Figure 1(c). The Appendix con- tractable and stable and can be estimated by the efficient backprop-
tains the detailed derivations of the model. agation algorithm.
partitioning of the whole dataset [38]. Thus, we set the threshold 2 http://www.nltk.org/book/ch02.html
RCV1 Reuters
Methods 8 bits
16 bits 32 bits 64 bits 128 bits 8 bits 16 bits 32 bits 64 bits 128 bits
LSH [2] 0.4180
0.4352 0.4716 0.5214 0.5877 0.2802 0.3215 0.3862 0.4667 0.5194
SpH [38] 0.5093
0.7121 0.7475 0.7559 0.7423 0.6080 0.6340 0.6513 0.6290 0.6045
STHs [41] 0.3975
0.4898 0.5592 0.5945 0.5946 0.6616 0.7351 0.7554 0.7350 0.6986
Stacked RBMs [28] 0.5106
0.5743 0.6130 0.6463 0.6531 0.5113 0.5740 0.6154 0.6177 0.6452
KSH [20] 0.9126
0.9146 0.9221 0.9333 0.9350 0.7840 0.8376 0.8480 0.8537 0.8620
SHTTM [36] 0.8820
0.9038 0.9258 0.9459 0.9447 0.7992 0.8520 0.8323 0.8271 0.8150
VDSH 0.7976
0.7944 0.8481 0.8951 0.8444 0.6859 0.7165 0.7753 0.7456 0.7318
VDSH-S 0.9652†
0.9749† 0.9801† 0.9804† 0.9800† 0.9005† 0.9121† 0.9337† 0.9407† 0.9299†
VDSH-SP 0.9757† 0.9788† 0.9805† 0.9794† 0.8890† 0.9326† 0.9283† 0.9286† 0.9395†
0.9666†
20Newsgroups TMC
Methods 8 bits 16 bits 32 bits 64 bits 128 bits 8 bits 16 bits 32 bits 64 bits 128 bits
LSH [2] 0.0578 0.0597 0.0666 0.0770 0.0949 0.4388 0.4393 0.4514 0.4553 0.4773
SpH [38] 0.2545 0.3200 0.3709 0.3196 0.2716 0.5807 0.6055 0.6281 0.6143 0.5891
STH [41] 0.3664 0.5237 0.5860 0.5806 0.5443 0.3723 0.3947 0.4105 0.4181 0.4123
Stacked RBMs [28] 0.0594 0.0604 0.0533 0.0623 0.0642 0.4846 0.5108 0.5166 0.5190 0.5137
KSH [20] 0.4257 0.5559 0.6103 0.6488 0.6638 0.6608 0.6842 0.7047 0.7175 0.7243
SHTTM [36] 0.2690 0.3235 0.2357 0.1411 0.1299 0.6299 0.6571 0.6485 0.6893 0.6474
VDSH 0.3643 0.3904 0.4327 0.1731 0.0522 0.4330 0.6853 0.7108 0.4410 0.5847
VDSH-S 0.6586† 0.6791† 0.7564† 0.6850† 0.6916† 0.7387† 0.7887† 0.7883† 0.7967† 0.8018†
VDSH-SP 0.6609† 0.6551† 0.7125† 0.7045† 0.7117† 0.7498† 0.7798† 0.7891† 0.7888† 0.7970†
Table 1: Precision of the top 100 retrieved documents on four datasets with different numbers of hashing bits. The bold font
denotes the best result at that number of bits. † denotes the improvement over the best result of the baselines is statistically
significant based on the paired t-test (p-value < 0.01).
text corpus for text classification. This collection has 10,788 doc- in the prior work [36]: Locality Sensitive Hashing (LSH)6 [2], Spec-
uments with 90 categories and 7,164 unique words. 3) 20News- tral Hashing (SpH)7 [38], Self-taught Hashing (STH)8 [41], Stacked
groups3 . This dataset is a collection of 18,828 newsgroup posts, par- Restricted Boltzmann Machines (Stacked RBMs) [28], Supervised
titioned (nearly) evenly across 20 different newsgroups/categories. Hashing with Kernels (KSH) [20], and Semantic Hashing using
It has become a popular dataset for experiments in text applica- Tags and Topic Modeling (SHTTM) [36]. We used the validation
tions of machine learning techniques. 4) TMC 4 . This dataset con- dataset to choose the hyperparameters for the baselines.
tains the air traffic reports provided by NASA and was used as For our proposed models, we adopt the method in [6] for weight
part of the SIAM text mining competition. It has 22 labels, 21,519 initialization. The Adam optimizer [12] with the step size 0.001 is
training documents, 3,498 test documents, and 3,498 documents for used due to its fast convergence. Following the practice in [37],
the validation set. All the datasets are multi-label except 20News- we use the dropout technique [30] with the keep probability of 0.8
groups. in training to alleviate overfitting. The number of hidden nodes of
Each dataset was split into three subsets with roughly 80% for the models is 1,500 for RCV1 and 1,000 for the other three smaller
training, 10% for validation, and 10% for test. The training data is datasets. All the experiments were conducted on a server with
used to learn the mapping from document to hash code. Each doc- 2 Intel E5-2630 CPUs and 4 GeForce GTX TITAN X GPUs. The
ument in the test set is used to retrieve similar documents based proposed deep models were implemented on the Tensorflow9 plat-
on the mapping, and the results are evaluated. The validation set is form. For the VDSH model on the Reuters21578, 20Newsgroups,
used to choose the hyperparameters. We removed the stopwords and TMC datasets, each epoch takes about 60 seconds, and each
using SMART’s list of 571 stopwords5 . No stemming was per- run takes 30 epochs to converge. For RCV1, it takes about 3,600 sec-
formed. We use TFIDF [23] as the default term weighting scheme onds per epoch and needs fewer epochs (about 15) to get satisfac-
for the raw document representation (i.e., d). We experiment with tory performance. Since RCV1 is much larger than the other three
other term weighting schemes in Section 5.4. datasets, this shows that the proposed models are quite scalable.
VDSH-S and VDSH-SP take slightly more time to train than VDSH
4.2 Baselines and Settings does (about 40 minutes each on Reuters21578, 20Newsgroups, and
We compare the proposed models with the following six compet- TMC, and 20 hours on RCV1).
itive baselines which have been extensively used for text hashing
6 http://pixelogik.github.io/NearPy/
3 http://ana.cachopo.org/datasets-for-single-label-text-categorization 7 http://www.cs.huji.ac.il/∼yweiss/SpectralHashing/
4 https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset 8 http://www.dcs.bbk.ac.uk/∼dell/publications/dellzhang sigir2010/sth v1.zip
5 http://www.lextek.com/manuals/onix/stopwords2.html 9 https://www.tensorflow.org/
RCV1 Reuters 20Newsgroups TMC
1 1 1 1
LSH
SpH
0.8 0.8 0.8 STH 0.8
RBMs
Precision
0 0 0 0
816 32 64 128 816 32 64 128 816 32 64 128 816 32 64 128
Figure 2: The Precision within the Hamming distance of 2 on four datasets with different hashing bits.
4.3 Evaluation Metrics many categories in Newsgroup are correlated, LDA could assign
To evaluate the effectiveness of hashing code in similarity search, similar topics to documents from related categories (i.e. Christian,
each document in the test set is used as a query document to search Religion). Hence SHTTM may not effectively distinguish two re-
for similar documents based on the Hamming distance (i.e., num- lated categories. Evidently, SHTTM and KSH deliver comparable
ber of different bits) between their hashing codes. Following the results except on the 20Newsgroups testbed. It is worth noting that
prior work in text hashing [36], the performance is measured by there exist substantial gaps between the supervised and unsuper-
Precision, as the ratio of the number of retrieved relevant docu- vised proposed models (VDSH-S and VDSH-SP vs VDSH) across
ments to the number of all retrieved documents. The results are all the datasets and configurations. The label information seems
averaged over all the test documents. remarkably useful for guiding the deep generative models to learn
There exist various ways to determine whether a retrieved doc- effective representations. This is probably due to the high capacity
ument is relevant to the given query document. In SpH [38], the of the neural network component which can learn subtle patterns
K closest documents in the original feature space are considered from supervisory signals when available.
as the relevant documents. This metric is not desirable since the Thirdly, the performance does not always improve when the
similarity in the original feature space may not well reflect the number of bits increases. This pattern seems quite consistent across
document semantic similarity. Also, it is hard to determine a suit- all the compared methods and it is likely the result of model over-
able K for the cutoff threshold. Instead, we adopt the methodology fitting, which suggests that using a long hash code is not always
used in KSH [32], SHTTM [36] and other prior work [32], that is helpful especially when training data is limited. Last but not least,
a retrieved document that shares any common test label with the the testbeds may affect the model performance. All the best results
query document is regarded as a relevant document. are obtained on the RCV1 dataset whose size is much larger than
the other testbeds. These results illustrate the importance of using
5 EXPERIMENTAL RESULTS a large amount of data to train text hashing models.
It is worth noting that some of the baseline results are differ-
5.1 Baseline Comparison ent from what were reported in the prior work. This is due to the
Table 1 shows the results of different methods over various num- data preprocessing. For example, [36] combined some categories
bers of bits on the four testbeds. We have several observations in 20Newsgroup to form 6 broader categories in their experiments
from the results. First of all, the best results at different bits are all while we use all the original 20 categories for evaluation. [41] fo-
achieved by VDSH-S or VDSH-SP. They consistently yield better re- cused on single-label documents by discarding the documents ap-
sults than all the baselines across all the different numbers of bits. pearing in more than one category while we use all the documents
All the improvements over the baselines are statistically significant in the corpus.
based on the paired t-test (p-value < 0.01). VDSH-S and VDSH-SP
produce comparable results between them. Adding private vari-
ables does not always help because it increases the model flexibility
which may lead to overfitting to the training data. This probably 5.2 Retrieval with Fixed Hamming Distance
explains why VDSH-SP generally yield better performance when In practice, IR systems may retrieve similar documents in a large
the number of bits is 8 which corresponds to a simpler model. corpus within a fixed Hamming distance radius to the query docu-
Secondly, the supervised hashing techniques (i.e., VDSH-S, VDSH- ment. In this section, we evaluate the precision for the Hamming
SP, KSH) outperform the unsupervised methods on the four datasets radius of 2. Figure 2 shows the results on four datasets with dif-
across all the different bits. These results demonstrate the impor- ferent numbers of hashing bits. We can see that the overall best
tance of utilizing supervisory signals for text hashing. However, performance among all nine hashing methods on each dataset is
the unsupervised model, STHs, outperforms SHTTM on the orig- achieved by either VDSH-S or VDSH-SP at the 16-bit. In general,
inal 20 categories Newsgroups. One possible explanation is that the precision of most of the methods decreases when the number
SHTTM depends on LDA to learn an initial representation. But of hashing bits increases from 32 to 128. This may be due to the
RCV1 Reuters
1 1
RCV1 Reuters
Median Sign Median Sign 0.8 0.8
0.2 0.2
0 0
VDSH VDSH-S VDSH-SP VDSH VDSH-S VDSH-SP
fact that when using longer hashing bits, the Hamming space be-
comes increasingly sparse and very few documents fall within the
Figure 3: Precision@100 for different term weighting
Hamming distance of 2, resulting in more queries with precision
schemes on the proposed models with the 32-bit hash code.
0. Similar behavior is also observed in the prior work such as KSH
[20] and SHTTM [36]. A notable exception is Stacked RBMs whose
performance is quite stable across different numbers of bits while
lags behind the best performers. 5.5 Qualitative Analysis
In this section, we visualize the low-dimensional representations
5.3 Effect of Thresholding of the documents and see whether they can preserve the seman-
tics of the original documents. Specifically, we use t-SNE10 [22] to
Thresholding is an important step in hashing to transform a con-
generate the scatter plots for the document latent semantic vectors
tinuous document representation to a binary code. We investigate
in 32-dimensional space obtained by SHTTM and VDSH-S on the
two popular thresholding functions: Median and Sign, which are
20Newsgroup dataset. Figure 4 shows the results. Here, each data
introduced in Section 3.4. Table 2 contains the precision results of
point represents a document which is associated with one of the
the proposed models with the 32-bit hash code on the four datasets.
20 categories. Different colors represent different categories based
As we can see, the two thresholding functions generate quite sim-
on the ground truth.
ilar results and their differences are not statistically significant,
As we can see in Figure (4)(b), VDSH-S generates well separated
which indicates all the proposed models, whether being unsuper-
clusters with each corresponding to a true category (each number
vised or supervised, are not sensitive to the thresholding methods.
in the plot represents a category ID). On the other hand, the clus-
tering structure from SHTTM shown in Figure (4)(a) is much less
5.4 Effect of Term Weighting Schemes evident and recognizable. Some closeby clusters in Figure (4)(b) are
In this section, we investigate the effect of term weighting schemes also semantically related, e.g., Category 7 (Religion) and Category
on the performance of the proposed models. Different term weights 11 (Atheism); Category 20 (Middle East) and Category 10 (Guns);
result in different bag-of-word representations of d as the input to Category 8 (WinX) and Category 5 (Graphics).
the neural network. Specifically, we experiment with three term We further sampled some documents from the dataset and see
weighting representations for documents: Binary, Term Frequency where they are represented in the plots. Table 3 contains the Do-
(TF), Term Frequency and Inverse Document Frequency (TFIDF) cIDs, categories, and subjects of the sample documents. Doc5780
[23]. Figure 3 illustrates the results of the proposed models with discusses some trade rumor in NHL and Doc5773 is about NHL
the 32-bit hash code on the four datasets. As we can see, the pro- team leaders. Both documents belong to the category of Hockey
posed models generally are not very sensitive to the underlying and should be close to each other, which can be clearly observed
term weighting schemes. The TFIDF weighting always gives the in Figure (4)(b) by VDSH-S. However, these two documents are
best performance on all the four datasets. The improvement is projected far away from each other by SHTTM as shown in Fig-
more noticeable with VDSH-S and VDSH-SP on 20Newsgroups. ure (4)(a). For another random pair of documents Doc3420 and
The results indicate more sophisticated weighting schemes may Doc3412 in the plots, VDSH-S also puts them much closer to each
capture more information about the original documents and thus other than SHTTM does. These results demonstrate the great effec-
lead to better hashing results. One the other hand, all the three tiveness of VDSH-S in learning low-dimensional representations
models yield quite stable results on RCV1, which suggests that a for text documents.
large-scale dataset may help alleviate the shortcomings of the basic
term weighting schemes. 10 https://lvdmaaten.github.io/tsne/
Doc5773
Doc3420 Doc5780 4
1 3 12 10
13
20
16 7
Doc3412 17 8
15
9 11
18
19
Doc3420 2 5
6 14
Doc5773 Doc5780
Doc3412 1:Biker 2:MAC 3:Politics 4:Christian 5:Graphics 6:Medicines 7:Religion
8:WinX 9:IBM 10:Guns 11:Atheism 12:MS 13:Crypt 14:Space 15:ForSale
16:Hockey 17:Baseball 18:Electronics 19:Autos 20:MidEast
Figure 4: Visualization of the 32-dimensional document latent semantic vectors by SHTTM and VDSH-S on the 20Newsgroup
dataset using t-SNE. Each point represents a document and different colors denote different categories based on the ground
truth. In (b)VDSH-S, each number is a category ID and the corresponding categories are shown below the plot.
DocId Category Title/Subject for encoder and decoder. These more sophisticated models may
Doc5780 Hockey Trade rumor: Montreal/Ottawa/Phillie be able to capture the local relations (by CNN) or sequential infor-
Doc5773 Hockey NHL team leaders in +/- mation (by RNN, NADE, MADE) in text. Moreover, we will utilize
Doc3420 ForSale Books For Sale [Ann Arbor, MI] the probabilistic generative process to sample and simulate new
Doc3412 ForSale *** NeXTstation 8/105 For Sale *** text, which may facilitate the task of Natural Language Genera-
Table 3: The titles of the four sample documents in Figure 4 tion [26]. Last but not least, we will adapt the proposed models to
hash other types of data such as images and videos.