VDSH

This paper presents a series of novel deep document generative models for text hashing, leveraging variational inference and deep learning techniques. The models aim to create compact binary representations of text documents, enhancing similarity search efficiency while addressing limitations in expressiveness and flexibility of existing methods. Experimental results demonstrate significant improvements in performance over established semantic hashing baselines, indicating the effectiveness of the proposed supervised learning approaches.

Uploaded by

bingdiao0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

VDSH

Uploaded by

bingdiao0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Variational Deep Semantic Hashing for Text Documents

Suthee Chaidaroon Yi Fang

Department of Computer Engineering Department of Computer Engineering
Santa Clara University Santa Clara University
Santa Clara, CA 95053, USA Santa Clara, CA 95053, USA
[email protected] [email protected]
ABSTRACT 1 INTRODUCTION
arXiv:1708.03436v1 [cs.IR] 11 Aug 2017

As the amount of textual data has been rapidly increasing over The task of similarity search, also known as nearest neighbor search,
the past decade, efficient similarity search methods have become proximity search, or close item search, is to find similar items given
a crucial component of large-scale information retrieval systems. a query object [35]. It has many important information retrieval
A popular strategy is to represent original data samples by com- applications such as document clustering, content-based retrieval,
pact binary codes through hashing. A spectrum of machine learn- and collaborative filtering [33]. The rapid growth of Internet has
ing methods have been utilized, but they often lack expressiveness resulted in massive textual data in the recent decades. In addi-
and flexibility in modeling to learn effective representations. The tion to the cost of storage, searching for relevant content in gi-
recent advances of deep learning in a wide range of applications gantic databases is even more daunting. Traditional text similarity
has demonstrated its capability to learn robust and powerful fea- computations are conducted in the original vector space and could
ture representations for complex data. Especially, deep generative be prohibitive to use for large-scale corpora since these methods
models naturally combine the expressiveness of probabilistic gen- are involved with high cost of numerical computation in the high-
erative models with the high capacity of deep neural networks, dimensional spaces.
which is very suitable for text modeling. However, little work has Many research efforts have been devoted to approximate sim-
leveraged the recent progress in deep learning for text hashing. ilarity search that is shown to be useful for practical problems.
In this paper, we propose a series of novel deep document gen- Hashing [5, 28, 38] is an effective solution to accelerate similarity
erative models for text hashing. The first proposed model is un- search by designing compact binary codes in a low-dimensional
supervised while the second one is supervised by utilizing docu- space so that semantically similar documents are mapped to sim-
ment labels/tags for hashing. The third model further considers ilar codes. This approach is much more efficient in memory and
document-specific factors that affect the generation of words. The computation. A binary representation of each document often only
probabilistic generative formulation of the proposed models pro- needs 4 or 8 bytes to store, and thus a large number of encoded doc-
vides a principled framework for model extension, uncertainty es- uments can be directly loaded into the main memory. Computing
timation, simulation, and interpretability. Based on variational in- similarity between two documents can be accomplished by using
ference and reparameterization, the proposed models can be inter- bitwise XOR operation which takes only one CPU instruction. A
preted as encoder-decoder deep neural networks and thus they are spectrum of machine learning methods have been utilized in hash-
capable of learning complex nonlinear distributed representations ing, but they often lack expressiveness and flexibility in modeling,
of the original documents. We conduct a comprehensive set of ex- which prevents them from learning compact and effective repre-
periments on four public testbeds. The experimental results have sentations of text documents.
demonstrated the effectiveness of the proposed supervised learn- On the other hand, deep learning has made tremendous progress
ing models for text hashing. in the past decade and has demonstrated impressive successes in a
variety of domains including speech recognition, computer vision,
CCS CONCEPTS and natural language processing [18]. One of the main purposes
•Information systems →Information retrieval; •Computing of deep learning is to learn robust and powerful feature represen-
methodologies →Neural networks; Learning latent repre- tations for complex data. Recently, deep generative models with
sentations; variational inference [14, 27] have further boosted the expressive-
ness and flexibility for representation learning by integrating deep
KEYWORDS neural nets into the probabilistic generative framework. The seam-
less combination of generative modeling and deep learning makes
Semantic hashing; Variational autoencoder; Deep learning them suitable for text hashing. However, to the best of our knowl-
edge, no prior work has leveraged them for hashing tasks.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
In this paper, we propose a series of novel deep document gen-
for profit or commercial advantage and that copies bear this notice and the full cita- erative models for text hashing, inspired by variational autoen-
tion on the first page. Copyrights for components of this work owned by others than coder (VAE) [14, 27]. The proposed models are the marriage of
the author(s) must be honored. Abstracting with credit is permitted. To copy other-
wise, or republish, to post on servers or to redistribute to lists, requires prior specific deep learning and probabilistic generative models [1]. They enjoy
permission and/or a fee. Request permissions from [email protected]. the good properties of both learning paradigms. First, with the
SIGIR ’17, August 07-11, 2017, Shinjuku, Tokyo, Japan deep neural networks, the proposed models can learn flexible non-
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-1-4503-5022-8/17/08. . . $15.00 linear distributed representations of the original high-dimensional
DOI: http://dx.doi.org/10.1145/3077136.3080816 documents. This allows individual codes to be fairly general and
concise but their intersection to be much more precise. For exam- [2] is one of the most popular hashing methods with interesting as-
ple, nonlinear distributed representations allow the topics/codes ymptotic theoretical properties leading to performance guarantees.
“government,” “mafia,” and “playboy” to combine to give very high While LSH is a data-independent hashing method, many hashing
probability to the word “Berlusconi,” which is not predicted nearly methods have been recently proposed to leverage machine learn-
as strongly by each topic/code alone. ing techniques with the goal of learning data-dependent hash func-
Meanwhile, the proposed models are probabilistic generative tions, ranging from unsupervised and supervised to semi-supervised
models and thus there exists an underlying data generation process settings. Unsupervised hashing methods attempt to integrate the
characterizing each model. The probabilistic generative formula- data properties, such as distributions and manifold structures to
tion provides a principled framework for model extensions such design compact hash codes with improved accuracy. For instance,
as incorporating supervisory signals and adding private variables. Spectral Hashing (SpH) [38] explores the data distribution by pre-
The first proposed model is unsupervised and can be interpreted as serving the similarity between documents by forcing the balanced
a variant of variational autoencoder for text documents. The other and uncorrelated constraints into the learned codes, which can be
two models are supervised by utilizing the document label/tag in- viewed as an extension of spectral clustering [25]. Graph hashing
formation. The prior work in the literature [36] has demonstrated [21] utilizes the underlying manifold structure of data captured by
that the supervisory signals are crucial to boost the performance a graph representation. Self Taught Hashing (STH) [41] is the state-
of semantic hashing for text documents. The third model further of-the-art hashing method by decomposing the learning procedure
adds a private latent variable for documents to capture the infor- into two steps: generating binary code and learning hash function.
mation only concerned with the documents but irrelevant to the Supervised hashing methods attempt to leverage label/tag in-
labels, which may help remove noises from document representa- formation for hash function learning. It has attracted more and
tions. Furthermore, specific constraints can be enforced by making more attention in recent years. For example, Wang et al. [36] pro-
explicit assumptions in the models. One desirable property of hash pose Semantic Hashing using Tags and Topic Modeling (SHTTM)
code is to ensure the bits are uncorrelated so that the next bit can- to incorporate tags to obtain more effective hashing codes via a
not be predicted based on the previous bits [38]. To achieve this matrix factorization formulation. To utilize the pairwise supervi-
property, we can just assume that the latent variable has a prior sion information in the hash function learning, Kernel-Based Su-
distribution with independent dimensions. pervised Hashing (KSH) proposed in [20] used a pairwise relation-
In sum, the probabilistic generative formulation provides a prin- ship between samples to achieve high-quality hashing. Binary Re-
cipled framework for model extensions, interpretability, uncertainty constructive Embedding (BRE) [15] was proposed to learn hash
estimation, and simulation, which are often lacking in deep learn- functions by minimizing the reconstructed error between the met-
ing models but useful in text hashing. The main contributions of ric space and Hamming space. Moreover, there are also several
the paper can be summarized as follow: works using the ranking order information to design hash func-
tions. Ranking-based Supervised Hashing (RSH) [34] was proposed
• We proposed a series of unsupervised and supervised deep to leverage listwise supervision into the hash function learning
document generative models to learn compact representa- framework. Semi-supervised learning paradigm was also employed
tions for text documents. To the best of our knowledge, to design hash functions by using both labeled and unlabeled data
this is the first work that utilizes deep generative models [32]. The hashing-code learning problem is essentially a discrete
with variational inference for text hashing. optimization problem which is difficult to solve. Most existing su-
• The proposed models enjoy both advantages of deep learn- pervised hashing methods try to solve a relaxed continuous opti-
ing and probabilistic generative models. They can learn mization problem and then threshold the continuous representa-
complex nonlinear distributed representations of the orig- tion to obtain a binary code. Abundant related work, especially on
inal high-dimensional documents while providing a prin- image hashing, exists in the literature. Two recent surveys [33, 35]
cipled framework for probabilistic reasoning. provide a comprehensive literature review.
• We derived tractable variational lowerbounds for the pro-
posed models and reparameterize the models so that back-
propagation can be applied for efficient parameter estima- 2.2 Deep Learning
tion. Deep learning has drawn increasing attention and research efforts
• We conducted a comprehensive set of experiments on four in a variety of artificial intelligence areas including speech recogni-
public testbeds. The experimental results demonstrate sig- tion, computer vision, and natural language processing. Since one
nificant improvements in our supervised models over sev- main purpose of deep learning is to learn robust and powerful fea-
eral well-known semantic hashing baselines. ture representations for complex data, it is very natural to leverage
deep learning for exploring compact hash codes which can be re-
2 RELATED WORK garded as binary representations of data. Most of the related work
has focused on image data [4, 16, 19, 39] rather than text docu-
2.1 Hashing ments probably due to the effectiveness of the convolution neural
Due to computational and storage efficiencies of compact binary networks (CNNs) to learn good low-dimensional representations
codes, hashing methods have been widely used for similarity search, of images. The typical deep learning architectures for hash func-
which is an essential component in a variety of large-scale information learning consist of CNNs layers for representation learning
tion retrieval systems [33, 35]. Locality-Sensitive Hashing (LSH) and hash function layers which then transform the representation
to supervisory signals. The loss functions could be pointwise [19], each document. Let d ∈ RV be the bag-of-words representation of
pairwise [4], or listwise [16]. a document and wi ∈ {0, 1}V be the one-hot vector representation
Some recent works have applied deep learning for several IR of the i th word of the document where V is the vocabulary size. d
tasks such as ad-hoc retrieval [9], web search [11], and ranking could be represented by different term weighting schemes such as
pairs of short texts [29]. However, very few has investigated deep binary, TF, and TFIDF [23]. The document generative process can
learning for text hashing. The representative work is semantic be described as follows:
hashing [28]. It builds a stack of restricted Boltzmann machines • For each document d,
(RBMs) [10] to discover hidden binary units which can model in- – Draw a latent semantic vector s ∼ P(s) where P(s) =
put text data (i.e., word-count vectors). After learning a multilayer N (0, I ) is the standard Gaussian distribution.
RBM through pretraining and fine tuning on a collection of docu- – For the i th word in the document,
ments, the hash code of any document is acquired by simply thresh- ∗ Draw wi ∼ P(wi | f (s; θ)).
olding the output of the deepest layer. A recent work [40] exploited
convolutional neural network for text hashing, which relies on ex- The conditional probability over words wi is modelled by multino-
ternal features such as the GloVe word embeddings to construct mial logistic regression and shared across documents as below:
text representations. exp(wiT f (s; θ))
Recently, deep generative models have made impressive progress P(wi | f (s; θ)) = ÍV (1)
T
with the introduction of the variational autoencoders (VAEs) [14, j=1 exp(w j f (s; θ))
27] and Generative Adversarial Networks (GANs) [8]. VAEs are es-
While P(s) is a simple Gaussian distribution, any distribution
pecially an appealing framework for generative modeling by cou-
can be generated by mapping the simple Gaussian through a suf-
pling the approach of variational inference [31] with deep learning.
ficiently complicated function [3]. Thus, f (s; θ)) is such a highly
As a result, they enjoy the advantages of both deep learning and
flexible function approximator usually a neural network. In other
probabilistic graphical models. Deep generative models parame-
words, we can learn a function which maps our independent, normally-
terized by neural networks have achieved state-of-the-art perfor-
distributed s values to whatever latent semantic variables might
mance in unsupervised and supervised learning [13, 14, 24]. To the
be needed for the model, and then generate the word wi . However,
best of our knowledge, our proposed models are the first work that
introducing a highly nonlinear mapping from s to wi results in in-
utilizes variational inference with deep learning for text hashing. ∫
tractable data likelihood s P(d |s)P(s)ds and thus intractable pos-
It is worth pointing out that both semantic hashing with stacked
terior distribution P(s |d) [14]. Similar to VAE, we use an approx-
RBMs [28] and our models are deep generative models, but the
imation Q(s |d;ϕ) for the true posterior distribution. By applying
former is undirected graphical models, and the latter is directed
the variational inference principle [31], we can obtain the follow-
models. The underlying generative process of directed probabilis-
ing tractable lowerbound of the document log likelihood (see [14]
tic models makes them easy to interpret and extend. The proposed
and Appendix):
models are very scalable since they are trained as deep neural net-
works by efficient backpropagation, while the stacked RBMs are
often much harder to train [10]. N
Õ
L1 = EQ log P(wi | f (s; θ)) − D K L (Q(s |d; ϕ) k P(s)) (2)
i =1
3 VARIATIONAL DEEP SEMANTIC HASHING
where N is the number of words in the document and D K L ( k )
This section presents three novel deep document generative mod-
is the Kullback-Leibler (KL) divergence between the approximate
els to learn low-dimensional semantic representations of documents
posterior distribution Q(s |d; ϕ) and the prior P(s). The variational
for text hashing. In Section 3.1, we introduce the basic model
distribution Q(s |d; ϕ) acts as a proxy to the true posterior P(s |d).
which is essentially a variational autoencoder for text modeling.
To enable a high capacity, it is assumed to be a Gaussian N (µ, diag(σ 2 ))
Section 3.2 extends the model to utilize label information to learn
whose mean µ and variance σ 2 are the output of a highly nonlin-
a more sensible representation. Section 3.3 further incorporates
ear function of d denoted as д(d; ϕ) parameterized by ϕ, once again
document private variables to model document-specific informa-
typically a neural network.
tion. Based on the variational inference, all the three models can
In training, the variational lowerbound in Eqn.(2) is maximized
be viewed as having an encoder-decoder neural network architec-
with respect to the model parameters. Since P(s) is a standard
ture where the encoder compresses a high-dimensional document
Gaussian prior, the KL Divergence D K L (Q(s |d; ϕ) k P(s)) in Eqn.(2)
to a compact latent semantic vector and the decoder reconstructs
can be computed analytically. The first term EQ can be viewed as
the document (or the labels). Section 3.4 discusses two threshold-
an expected negative reconstruction error of the words in the docu-
ing methods to convert the continues latent vector to a binary code
ment and it can be computed based on the Monte Carlo estimation
for text hashing.
[7].
Based on Eqn.(2), we can interpret VDSH as a variational au-
3.1 Unsupervised Learning (VDSH) toencoder with discrete output: a feedforward neural network en-
In this section, we present the basic variational deep semantic hash- coder Q(s |d; ϕ) compresses document representations into contin-
uous hidden vectors, i.e., d → s; a softmax decoder iN=1 P(wi | f (s; θ))
Í
ing (VDSH) model for the unsupervised learning setting. VDSH is
a probabilistic generative model of text which aims to extract a reconstructs the documents by independently generating the words
continuous low-dimensional semantic representation s ∈ RK for s → {wi }iN=1 . Figure 1(a) illustrates the architecture of VDSH. In
w y w y
w
Decoder

s s v s

σ µ σ µ σv µv σs µs

Encoder

d
(a) (b) (c)

Figure 1: Architectures of (a) VDSH, (b) VDSH-S, and (c) VDSH-SP. The dashed line represents a stochastic layer.

the experiments, we use the following specific architecture for the In this section, we extend VDSH to the supervised setting with the
encoder and decoder. new model denoted as VDSH-S. The probabilistic generative pro-
Encoder Q(s |д(d; ϕ)) : Decoder P(wi | f (s; θ)) : cess of a document with labels is as follows:
t 1 = ReLU(W1d + b 1 ) ci = exp(−s T Gwi + bw i ) • For each document d,
– Draw a latent semantic vector s ∼ P(s) where P(s) =
t 2 = ReLU(W2t 1 + b 2 ) ci
P(wi |s) = ÍV N (0, I ) is the standard Gaussian distribution.
µ = W3 t 2 + b 3 c
k=1 k – For the i th word in the document,
log σ = W4t 2 + b 4 N
Ö ∗ Draw wi ∼ P(wi | f (s; θ)).
2 P(d |s) = P(wi |s) – For the j th label in the label set,
s ∼ N (µ(d), diag(σ (d))) i =1 ∗ Draw y j ∼ P(y| f (s; τ )).
This architecture is similar to the one presented in VAE [27] except where y j ∈ {0, 1} L is the one-hot representation of the label j in
that VDSH has the softmax layer to model discrete words while the label set and L is the total number of possible labels (the size of
VAE is proposed to model images as continuous output. Here, the the label set). Let us use Y ∈ {0, 1} L represent the bag-of-labels of
encoder has two Rectified Linear Unit (ReLU) [7] layers. ReLU gen- the document (i.e., if the document has label j, the j th dimension
erally does not face gradient vanishing problem as with other ac- of Y is 1; otherwise, it is 0). VDSH-S assumes that both words and
tivation functions. Also, it has been shown that deep neural net- labels are generated based on the same latent semantic vecotor.
works can be trained efficiently using ReLU even without pretrain- We assume a general multi-label classification setting where
ing [7]. each document could have multiple labels/tags. P(y j | f (s; τ )) can
In this architecture, there is a stochastic layer which is to sam- be modeled by the logistic function as follows:
ple s from a Gaussian distribution N (µ(d), diag(σ 2 (d))), as repre-
sented by the dashed lines in the middle of the networks in Figure 1
P(y j | f (s; τ )) = (3)
1. Backpropagation cannot handle stochastic layer within the net- 1 + exp(−yTj f (s; τ ))
work. In practice, we can leverage the “location-scale” property
of Gaussian distribution, and use the reparameterization trick [14] Similar to VDSH, f (s; τ ) is parameterized by a neural network with
to turn the stochastic layer of s to be deterministic. As a result, the parameter τ so that we can learn an effective mapping from the
the encoder Q(s |d;ϕ) and decoder P(wi | f (s; θ)) form an end-to- latent semantic vector to the labels. The lowerbound of the data log
end neural network and are then trained jointly by maximizing the likelihood can be similarly derived and shown as follows:
variational lowerbound in Eqn.(2) with respect to their parameters N L
Õ Õ
by the standard backpropagation algorithm [7].

L2 = EQ log P(wi | f (s; θ)) + log P(y j | f (s; τ ))
i =1 j=1
3.2 Supervised Learning (VDSH-S) −D K L (Q(s |d, Y ; ϕ) k P(s)) (4)
In many real-world applications, documents are often associated
with labels or tags which may provide useful guidance in learn- Compared
ÍL to Eqn.(2) in VDSH, this lowerbound has an extra term,
ing effective hashing codes. Document content similarity in the EQ j=1 logP(y j | f (s; τ )) , which can be computed in a similar
ÍN
original bag-of-word space may not fully reflect the semantic rela- way with EQ i =1 log P(w i | f (s; θ)) in Eqn.(2), by using the Monte
tionship between documents. For example, two documents in the Carlo estimation. In addition, we can drop the dependence on vari-
same category may have low document content similarity due to able Y in the variational distribution Q(s |d, Y ; ϕ) since we may not
the vocabulary gap, while their semantic similarity could be high. have the label information available for new documents.
The architecture of the VDSH-S model is shown in Figure 1(b). It for binarizing the p th bit to be the median of the p th dimension of
consists of a feedforward neural network encoder of a document s in the training data. If the p th bit of document latent semantic
d → s and a decoder of the words and labels of the document vector µ new is larger than the median, the p th binary code is set to
s → {wi }iN=1 ; {y j } j=1
L . It is worth pointing out that the labels still
1; otherwise, it is set to -1. Another popular thresholding method
affect the learning of latent semantic vector by their presence in is to use the Sign function on µ new , i.e., if the p th bit of µ new is
the decoder despite their absence in the encoder. By using the nonnegative, the corresponding code is 1; otherwise, it is -1. Since
reparameterization trick, the model becomes a deterministic deep the prior distribution of the latent semantic vector is zero mean,
neural network and the lowerbound in Eqn.(4) can be maximized the Sign function is also a reasonable choice. We use the median
by backpropagation (see Appendix). thresholding as the default method in our experiments, while also
investigate the Sign function in Section 5.3.
3.3 Document-specific Modeling (VDSH-SP)
VDSH-S assumes both document and labels are generated by the 3.5 Discussions
same latent semantic vector s. In some cases, this assumption may The computational complexity of VDSH for a training document
be restrictive. For example, the original document may contain is O(BD 2 + DSV ). Here, O(BK 2) is the cost of the encoder, where
information that is irrelevant to the labels. It could be difficult to B is the number of the layers in the encoder network and D is
find a common representation for both documents and labels. This the average dimension of these layers. O(DNV ) is the cost of the
observation motivates us to introduce a document private variable decoder, where S is the average length of the documents and V
v, which is not shared by the labels Y . The generative process is is the vocabulary size. The computational complexity of VDSH-S
described as follows: and VDSH-SP is O(BD 2 + DS(V + L)) where L is the size of the
• For each document d, label set. The computational cost of the proposed models is at the
– Draw a latent semantic vector s ∼ P(s) where P(s) = same level as the deterministic autoencoder. Model learning could
N (0, I ) is the standard Gaussian distribution. be quite efficient since the computations of all the models can be
– Draw a latent private vector v ∼ P(v) where P(v) = parallelized in GPUs, and only one sample is required during the
N (0, I ) is the standard Gaussian distribution. training process.
– For the i th word in the document, The proposed deep generative model has a few desirable proper-
∗ Draw wi ∼ P(wi | f (s + v; θ)). ties for text hashing. First of all, it has the capacity of deep neural
– For the j th label in the label set, networks to learn sophisticated semantic representations for text
∗ Draw y j ∼ P(y| f (s; τ )). documents. Moreover, being generative models brings huge ad-
vantages over other deep learning models such as Convolutional
As we can see, s models the shared information between docu-
Neural Network (CNN) because the underlying document genera-
ment and labels while v only contains the document-specific in-
tive process makes the model assumptions explicit. For example,
formation. We can view adding private variables as removing the
as shown in [38], it is desirable to have independent feature di-
noise from the original content that is irrelevant to the labels. With
mensions in hash codes. To achieve this, our models just need
the added private variable, we denote this model as VDSH-SP. The
to assume the latent semantic vector is drawn from a prior dis-
tractable variational lowerbound of data likelihood can be derived
tribution with independent dimensions (e.g., standard Gaussian).
as follows:
The probabilistic approach also provides a principled framework
for model extensions as evident in VDSH-S and VDSH-SP. Fur-
N L thermore, instead of learning a particular latent semantic vector,
Õ Õ
L3 = EQ log P(wi | f (s + v; θ)) + log P(y j | f (s; τ )) our models learn probability distributions of the semantic vector.
i =1 j=1
This can be viewed as finding a region instead of a fixed point
−D K L (Q(s |d; ϕ) k P(s)) − D K L (Q(v |d; ϕ) k P(v)) (5) in the latent space for document representation, which leads to
Similar to the other two models, VDSH-SP can be viewed as a more robust models. Compared with other deep generative models
deep neural network by applying variational inference and reparametriza-such as stacked RBMs and GANs, our models are computationally
tion. The architecture is shown in Figure 1(c). The Appendix con- tractable and stable and can be estimated by the efficient backprop-
tains the detailed derivations of the model. agation algorithm.

3.4 Binary Hash Code 4 EXPERIMENTAL SETUP

Once a VDSH model has been trained, we can generate a compact 4.1 Data Collections
continuous representation for any new document d new by the en- We use the following four public document collections for evalu-
coder function µ new = д(d new ; ϕ), which is the mean of the dis- ation. 1) Reuters Corpus Volume I (RCV1). It is a large collection
tribution Q(s |d;ϕ). The binary hashing code can then be obtained of manually labeled 800,000 newswire stories provided by Reuters.
by thresholding µ new . The most common method of thresholding There are totally 103 classes. We use the full-topics version avail-
for binary code is to take the median value of the latent semantic able at the LIBSVM website1 . 2) Reuters21578 2 . A widely used
vector µ in the training data [36]. The rationale is based on the
maximum entropy principle for eﬃciency which yields balanced 1 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multilabel.html

partitioning of the whole dataset [38]. Thus, we set the threshold 2 http://www.nltk.org/book/ch02.html
RCV1 Reuters
Methods 8 bits
16 bits 32 bits 64 bits 128 bits 8 bits 16 bits 32 bits 64 bits 128 bits
LSH [2] 0.4180
0.4352 0.4716 0.5214 0.5877 0.2802 0.3215 0.3862 0.4667 0.5194
SpH [38] 0.5093
0.7121 0.7475 0.7559 0.7423 0.6080 0.6340 0.6513 0.6290 0.6045
STHs [41] 0.3975
0.4898 0.5592 0.5945 0.5946 0.6616 0.7351 0.7554 0.7350 0.6986
Stacked RBMs [28] 0.5106
0.5743 0.6130 0.6463 0.6531 0.5113 0.5740 0.6154 0.6177 0.6452
KSH [20] 0.9126
0.9146 0.9221 0.9333 0.9350 0.7840 0.8376 0.8480 0.8537 0.8620
SHTTM [36] 0.8820
0.9038 0.9258 0.9459 0.9447 0.7992 0.8520 0.8323 0.8271 0.8150
VDSH 0.7976
0.7944 0.8481 0.8951 0.8444 0.6859 0.7165 0.7753 0.7456 0.7318
VDSH-S 0.9652†
0.9749† 0.9801† 0.9804† 0.9800† 0.9005† 0.9121† 0.9337† 0.9407† 0.9299†
VDSH-SP 0.9757† 0.9788† 0.9805† 0.9794† 0.8890† 0.9326† 0.9283† 0.9286† 0.9395†
0.9666†
20Newsgroups TMC
Methods 8 bits 16 bits 32 bits 64 bits 128 bits 8 bits 16 bits 32 bits 64 bits 128 bits
LSH [2] 0.0578 0.0597 0.0666 0.0770 0.0949 0.4388 0.4393 0.4514 0.4553 0.4773
SpH [38] 0.2545 0.3200 0.3709 0.3196 0.2716 0.5807 0.6055 0.6281 0.6143 0.5891
STH [41] 0.3664 0.5237 0.5860 0.5806 0.5443 0.3723 0.3947 0.4105 0.4181 0.4123
Stacked RBMs [28] 0.0594 0.0604 0.0533 0.0623 0.0642 0.4846 0.5108 0.5166 0.5190 0.5137
KSH [20] 0.4257 0.5559 0.6103 0.6488 0.6638 0.6608 0.6842 0.7047 0.7175 0.7243
SHTTM [36] 0.2690 0.3235 0.2357 0.1411 0.1299 0.6299 0.6571 0.6485 0.6893 0.6474
VDSH 0.3643 0.3904 0.4327 0.1731 0.0522 0.4330 0.6853 0.7108 0.4410 0.5847
VDSH-S 0.6586† 0.6791† 0.7564† 0.6850† 0.6916† 0.7387† 0.7887† 0.7883† 0.7967† 0.8018†
VDSH-SP 0.6609† 0.6551† 0.7125† 0.7045† 0.7117† 0.7498† 0.7798† 0.7891† 0.7888† 0.7970†
Table 1: Precision of the top 100 retrieved documents on four datasets with diﬀerent numbers of hashing bits. The bold font
denotes the best result at that number of bits. † denotes the improvement over the best result of the baselines is statistically
signiﬁcant based on the paired t-test (p-value < 0.01).

text corpus for text classification. This collection has 10,788 doc- in the prior work [36]: Locality Sensitive Hashing (LSH)6 [2], Spec-
uments with 90 categories and 7,164 unique words. 3) 20News- tral Hashing (SpH)7 [38], Self-taught Hashing (STH)8 [41], Stacked
groups3 . This dataset is a collection of 18,828 newsgroup posts, par- Restricted Boltzmann Machines (Stacked RBMs) [28], Supervised
titioned (nearly) evenly across 20 different newsgroups/categories. Hashing with Kernels (KSH) [20], and Semantic Hashing using
It has become a popular dataset for experiments in text applica- Tags and Topic Modeling (SHTTM) [36]. We used the validation
tions of machine learning techniques. 4) TMC 4 . This dataset con- dataset to choose the hyperparameters for the baselines.
tains the air traffic reports provided by NASA and was used as For our proposed models, we adopt the method in [6] for weight
part of the SIAM text mining competition. It has 22 labels, 21,519 initialization. The Adam optimizer [12] with the step size 0.001 is
training documents, 3,498 test documents, and 3,498 documents for used due to its fast convergence. Following the practice in [37],
the validation set. All the datasets are multi-label except 20News- we use the dropout technique [30] with the keep probability of 0.8
groups. in training to alleviate overfitting. The number of hidden nodes of
Each dataset was split into three subsets with roughly 80% for the models is 1,500 for RCV1 and 1,000 for the other three smaller
training, 10% for validation, and 10% for test. The training data is datasets. All the experiments were conducted on a server with
used to learn the mapping from document to hash code. Each doc- 2 Intel E5-2630 CPUs and 4 GeForce GTX TITAN X GPUs. The
ument in the test set is used to retrieve similar documents based proposed deep models were implemented on the Tensorflow9 plat-
on the mapping, and the results are evaluated. The validation set is form. For the VDSH model on the Reuters21578, 20Newsgroups,
used to choose the hyperparameters. We removed the stopwords and TMC datasets, each epoch takes about 60 seconds, and each
using SMART’s list of 571 stopwords5 . No stemming was per- run takes 30 epochs to converge. For RCV1, it takes about 3,600 sec-
formed. We use TFIDF [23] as the default term weighting scheme onds per epoch and needs fewer epochs (about 15) to get satisfac-
for the raw document representation (i.e., d). We experiment with tory performance. Since RCV1 is much larger than the other three
other term weighting schemes in Section 5.4. datasets, this shows that the proposed models are quite scalable.
VDSH-S and VDSH-SP take slightly more time to train than VDSH
4.2 Baselines and Settings does (about 40 minutes each on Reuters21578, 20Newsgroups, and
We compare the proposed models with the following six compet- TMC, and 20 hours on RCV1).
itive baselines which have been extensively used for text hashing
6 http://pixelogik.github.io/NearPy/
3 http://ana.cachopo.org/datasets-for-single-label-text-categorization 7 http://www.cs.huji.ac.il/∼yweiss/SpectralHashing/
4 https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset 8 http://www.dcs.bbk.ac.uk/∼dell/publications/dellzhang sigir2010/sth v1.zip
5 http://www.lextek.com/manuals/onix/stopwords2.html 9 https://www.tensorflow.org/
RCV1 Reuters 20Newsgroups TMC
1 1 1 1
LSH
SpH
0.8 0.8 0.8 STH 0.8
RBMs
Precision

0.6 0.6 0.6 KSH 0.6

SHTTM
VDSH
0.4 0.4 0.4 VDSH-S 0.4
VDSH-SP
0.2 0.2 0.2 0.2

0 0 0 0
816 32 64 128 816 32 64 128 816 32 64 128 816 32 64 128

Figure 2: The Precision within the Hamming distance of 2 on four datasets with diﬀerent hashing bits.

4.3 Evaluation Metrics many categories in Newsgroup are correlated, LDA could assign
To evaluate the effectiveness of hashing code in similarity search, similar topics to documents from related categories (i.e. Christian,
each document in the test set is used as a query document to search Religion). Hence SHTTM may not effectively distinguish two re-
for similar documents based on the Hamming distance (i.e., num- lated categories. Evidently, SHTTM and KSH deliver comparable
ber of different bits) between their hashing codes. Following the results except on the 20Newsgroups testbed. It is worth noting that
prior work in text hashing [36], the performance is measured by there exist substantial gaps between the supervised and unsuper-
Precision, as the ratio of the number of retrieved relevant docu- vised proposed models (VDSH-S and VDSH-SP vs VDSH) across
ments to the number of all retrieved documents. The results are all the datasets and configurations. The label information seems
averaged over all the test documents. remarkably useful for guiding the deep generative models to learn
There exist various ways to determine whether a retrieved doc- effective representations. This is probably due to the high capacity
ument is relevant to the given query document. In SpH [38], the of the neural network component which can learn subtle patterns
K closest documents in the original feature space are considered from supervisory signals when available.
as the relevant documents. This metric is not desirable since the Thirdly, the performance does not always improve when the
similarity in the original feature space may not well reflect the number of bits increases. This pattern seems quite consistent across
document semantic similarity. Also, it is hard to determine a suit- all the compared methods and it is likely the result of model over-
able K for the cutoff threshold. Instead, we adopt the methodology fitting, which suggests that using a long hash code is not always
used in KSH [32], SHTTM [36] and other prior work [32], that is helpful especially when training data is limited. Last but not least,
a retrieved document that shares any common test label with the the testbeds may affect the model performance. All the best results
query document is regarded as a relevant document. are obtained on the RCV1 dataset whose size is much larger than
the other testbeds. These results illustrate the importance of using
5 EXPERIMENTAL RESULTS a large amount of data to train text hashing models.
It is worth noting that some of the baseline results are differ-
5.1 Baseline Comparison ent from what were reported in the prior work. This is due to the
Table 1 shows the results of different methods over various num- data preprocessing. For example, [36] combined some categories
bers of bits on the four testbeds. We have several observations in 20Newsgroup to form 6 broader categories in their experiments
from the results. First of all, the best results at different bits are all while we use all the original 20 categories for evaluation. [41] fo-
achieved by VDSH-S or VDSH-SP. They consistently yield better re- cused on single-label documents by discarding the documents ap-
sults than all the baselines across all the different numbers of bits. pearing in more than one category while we use all the documents
All the improvements over the baselines are statistically significant in the corpus.
based on the paired t-test (p-value < 0.01). VDSH-S and VDSH-SP
produce comparable results between them. Adding private vari-
ables does not always help because it increases the model flexibility
which may lead to overfitting to the training data. This probably 5.2 Retrieval with Fixed Hamming Distance
explains why VDSH-SP generally yield better performance when In practice, IR systems may retrieve similar documents in a large
the number of bits is 8 which corresponds to a simpler model. corpus within a fixed Hamming distance radius to the query docu-
Secondly, the supervised hashing techniques (i.e., VDSH-S, VDSH- ment. In this section, we evaluate the precision for the Hamming
SP, KSH) outperform the unsupervised methods on the four datasets radius of 2. Figure 2 shows the results on four datasets with dif-
across all the different bits. These results demonstrate the impor- ferent numbers of hashing bits. We can see that the overall best
tance of utilizing supervisory signals for text hashing. However, performance among all nine hashing methods on each dataset is
the unsupervised model, STHs, outperforms SHTTM on the orig- achieved by either VDSH-S or VDSH-SP at the 16-bit. In general,
inal 20 categories Newsgroups. One possible explanation is that the precision of most of the methods decreases when the number
SHTTM depends on LDA to learn an initial representation. But of hashing bits increases from 32 to 128. This may be due to the
RCV1 Reuters
1 1
RCV1 Reuters
Median Sign Median Sign 0.8 0.8

VDSH 0.8481 0.8514 0.7753 0.7851 0.6 0.6

VDSH-S 0.9801 0.9804 0.9337 0.9284
0.4 0.4
VDSH-SP 0.9788 0.9794 0.9283 0.9346
Binary
20Newsgroups TMC 0.2 0.2
TF

Median Sign Median Sign 0 0

TFIDF

VDSH VDSH-S VDSH-SP VDSH VDSH-S VDSH-SP

VDSH 0.4354 0.4267 0.7108 0.7162
VDSH-S 0.7564 0.7563 0.7883 0.7879 20Newsgroups TMC
VDSH-SP 0.6913 0.6574 0.7891 0.7761 0.8 0.8

Table 2: Precision@100 of using diﬀerent thresholding func- 0.6 0.6

tions (Median vs Sign) for the proposed models on four
testbeds with the 32-bit hash code 0.4 0.4

0.2 0.2

0 0
VDSH VDSH-S VDSH-SP VDSH VDSH-S VDSH-SP

fact that when using longer hashing bits, the Hamming space be-
comes increasingly sparse and very few documents fall within the
Figure 3: Precision@100 for different term weighting
Hamming distance of 2, resulting in more queries with precision
schemes on the proposed models with the 32-bit hash code.
0. Similar behavior is also observed in the prior work such as KSH
[20] and SHTTM [36]. A notable exception is Stacked RBMs whose
performance is quite stable across different numbers of bits while
lags behind the best performers. 5.5 Qualitative Analysis
In this section, we visualize the low-dimensional representations
5.3 Effect of Thresholding of the documents and see whether they can preserve the seman-
tics of the original documents. Specifically, we use t-SNE10 [22] to
Thresholding is an important step in hashing to transform a con-
generate the scatter plots for the document latent semantic vectors
tinuous document representation to a binary code. We investigate
in 32-dimensional space obtained by SHTTM and VDSH-S on the
two popular thresholding functions: Median and Sign, which are
20Newsgroup dataset. Figure 4 shows the results. Here, each data
introduced in Section 3.4. Table 2 contains the precision results of
point represents a document which is associated with one of the
the proposed models with the 32-bit hash code on the four datasets.
20 categories. Different colors represent different categories based
As we can see, the two thresholding functions generate quite sim-
on the ground truth.
ilar results and their differences are not statistically significant,
As we can see in Figure (4)(b), VDSH-S generates well separated
which indicates all the proposed models, whether being unsuper-
clusters with each corresponding to a true category (each number
vised or supervised, are not sensitive to the thresholding methods.
in the plot represents a category ID). On the other hand, the clus-
tering structure from SHTTM shown in Figure (4)(a) is much less
5.4 Effect of Term Weighting Schemes evident and recognizable. Some closeby clusters in Figure (4)(b) are
In this section, we investigate the effect of term weighting schemes also semantically related, e.g., Category 7 (Religion) and Category
on the performance of the proposed models. Different term weights 11 (Atheism); Category 20 (Middle East) and Category 10 (Guns);
result in different bag-of-word representations of d as the input to Category 8 (WinX) and Category 5 (Graphics).
the neural network. Specifically, we experiment with three term We further sampled some documents from the dataset and see
weighting representations for documents: Binary, Term Frequency where they are represented in the plots. Table 3 contains the Do-
(TF), Term Frequency and Inverse Document Frequency (TFIDF) cIDs, categories, and subjects of the sample documents. Doc5780
[23]. Figure 3 illustrates the results of the proposed models with discusses some trade rumor in NHL and Doc5773 is about NHL
the 32-bit hash code on the four datasets. As we can see, the pro- team leaders. Both documents belong to the category of Hockey
posed models generally are not very sensitive to the underlying and should be close to each other, which can be clearly observed
term weighting schemes. The TFIDF weighting always gives the in Figure (4)(b) by VDSH-S. However, these two documents are
best performance on all the four datasets. The improvement is projected far away from each other by SHTTM as shown in Fig-
more noticeable with VDSH-S and VDSH-SP on 20Newsgroups. ure (4)(a). For another random pair of documents Doc3420 and
The results indicate more sophisticated weighting schemes may Doc3412 in the plots, VDSH-S also puts them much closer to each
capture more information about the original documents and thus other than SHTTM does. These results demonstrate the great effec-
lead to better hashing results. One the other hand, all the three tiveness of VDSH-S in learning low-dimensional representations
models yield quite stable results on RCV1, which suggests that a for text documents.
large-scale dataset may help alleviate the shortcomings of the basic
term weighting schemes. 10 https://lvdmaaten.github.io/tsne/
Doc5773
Doc3420 Doc5780 4

1 3 12 10
13
20
16 7
Doc3412 17 8
15
9 11
18
19
Doc3420 2 5
6 14
Doc5773 Doc5780
Doc3412 1:Biker 2:MAC 3:Politics 4:Christian 5:Graphics 6:Medicines 7:Religion
8:WinX 9:IBM 10:Guns 11:Atheism 12:MS 13:Crypt 14:Space 15:ForSale
16:Hockey 17:Baseball 18:Electronics 19:Autos 20:MidEast

(a) SHTTM (b) VDSH-S

Figure 4: Visualization of the 32-dimensional document latent semantic vectors by SHTTM and VDSH-S on the 20Newsgroup
dataset using t-SNE. Each point represents a document and diﬀerent colors denote diﬀerent categories based on the ground
truth. In (b)VDSH-S, each number is a category ID and the corresponding categories are shown below the plot.

DocId Category Title/Subject for encoder and decoder. These more sophisticated models may
Doc5780 Hockey Trade rumor: Montreal/Ottawa/Phillie be able to capture the local relations (by CNN) or sequential infor-
Doc5773 Hockey NHL team leaders in +/- mation (by RNN, NADE, MADE) in text. Moreover, we will utilize
Doc3420 ForSale Books For Sale [Ann Arbor, MI] the probabilistic generative process to sample and simulate new
Doc3412 ForSale *** NeXTstation 8/105 For Sale *** text, which may facilitate the task of Natural Language Genera-
Table 3: The titles of the four sample documents in Figure 4 tion [26]. Last but not least, we will adapt the proposed models to
hash other types of data such as images and videos.

6 CONCLUSIONS AND FUTURE WORK

Text hashing has become an important component in many large-
scale information retrieval systems. It attempts to map documents
in a high-dimensional space into a low-dimensional compact rep- APPENDIX
resentation, while preserving the semantic relationship of the doc- In this section, we show the derivations of the proposed models.
uments as much as possible. Deep learning is a powerful represen- Due to the page limit, we only focus on VDSH-SP, the most so-
tation learning approach and has demonstrated its effectiveness of phisticated one among the three models. The other two models
learning effective representations in a wide range of applications, can be similarly derived.
but there is very little prior work on utilizing it for text hashing The likelihood of document d and labels Y is:
tasks. In this paper, we exploit the recent advances in variational
autoencoder and propose a series of deep generative models for
text hashing. The models enjoy the advantages of both deep learn-
ing and probabilistic generative models. They can learn subtle non-
linear semantic representation in a principled probabilistic frame-
work, especially when supervisory signals are available. The exper-
imental results on four public testbeds demonstrate that the pro-
posed supervised models significantly outperform the competitive
baselines.
∫
This work is an initial step towards a promising research direc-
log P(d, Y ) = log P(d, Y , s,v)dsdv
tion. The probabilistic formulation and deep learning architecture s,v
provide a flexible framework for model extensions. In future work, ∫
P(d, Y , s,v)
we will explore deeper and more sophisticated architectures such = log Q(s,v |d, Y ) dsdv
s,v Q(s,v |d, Y )
as Convolutional Neural Network (CNN), Recurrent Neural Net-
work (RNN) [18] autoregressive neural network (NADE) [17]
Based on the Jensen’s Inequality [7],
M
log P(d, Y ) ≥ EQ (s,v) [log P(d, Y , s,v) − log Q(s,v |d, Y )] 1 Õ (m) (m) (m)
L= log P(d | f (s + v ; θ )) + log P(Y |s )
= EQ (s,v) [log P(d |s,v)P(Y |s)] + EQ (s,v) [log P(s) M m=1

log P(v) − log Q(s, v |d, Y )] (6) K

+ 1Õ 2 2 2 2 2 2
+ (2 + log σs,k − µ s,k − σs,k + log σv,k − µv,k − σv,k )
= EQ (s,v) [log P(d |s,v)P(Y |s)] + EQ (s,v) [log P(s) − log Q(s |d)] 2
k=1
+ EQ (s,v) [log P(v) − log Q(v |d)] (7) (13)
= EQ (s,v) [log P(d |s,v)P(Y |s)] − D K L (Q(s |d) k P(s))
ACKNOWLEDGMENTS
− D K L (Q(v |d) k P(v)) (8)
The authors would like to thank Mengwen Liu, for her many help-
= EQ (s,v) [log P(d |s,v)] + EQ (s,v) [log P(Y |s)] ful comments and suggestions on the first draft, Travis Ebesu for
− D K L (Q(s |d) k P(s)) − D K L (Q(v |d) k P(v)) (9) his drawing of the model architectures in Figure 1, and anonymous
Õ N L
Õ reviewers for valuable comments and feedback.
= EQ (s,v) log P(wi | f (s + v; θ)) + log P(y j | f (s; τ ))
i =1 j=1 REFERENCES
− D K L (Q(s |d; ϕ) k P(s)) − D K L (Q(v |d; ϕ) k P(v)) (10) [1] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84,
2012.
[2] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing
In Eqn.(6), we factorize the joint probability based on the gener- scheme based on p-stable distributions. In Proceedings of the twentieth annual
ative process. Thus, P(d, Y, s, v) = P(d |s,v)P(Y |s)P(s)P(v). In symposium on Computational geometry, pages 253–262. ACM, 2004.
Eqn.(7), the variational distribution, Q(s, v |d, Y ) is equal to the [3] C. Doersch. Tutorial on variational autoencoders. arXiv preprint
arXiv:1606.05908, 2016.
product of Q(s |d) and Q(v |d) by assuming the conditional inde- [4] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact
pendence ofs, v, Y given d. Eqn.(8) and Eqn.(9) are the results of binary codes learning. In CVPR, pages 2475–2483, 2015.
[5] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via
rearranging and simplifying terms in Eqn.(7). Plugging the indi- hashing. In VLDB, volume 99, pages 518–529, 1999.
vidual words and labels, we obtain the final lowerbound objective [6] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedfor-
function in Eqn.(10) (also in Eqn.(5)). ward neural networks. In AISTATS, pages 249–256, 2010.
[7] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.
Because of the Gaussian assumptions on latent semantic vector [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
s and latent private variable v, the two KL divergences in Eqn.(10) A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–
have analytic forms. We let µ s and σs are mean and standard de- 2680, 2014.
[9] J. Guo, Y. Fan, Q. Ai, and W. B. Croft. A deep relevance matching model for
viation of s. µv and σv are similar defined. We use subscript k to ad-hoc retrieval. In CIKM, pages 55–64. ACM, 2016.
denote the k th element of the vector. The following derivation is [10] G. E. Hinton. A practical guide to training restricted boltzmann machines. In
Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
an analytical form for a single KL divergence term: [11] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep
structured semantic models for web search using clickthrough data. In CIKM,
D K L (Q(s |d; ϕ) k P(s)) = EQ (s ) [log P(s)] − EQ (s ) [log Q(s |d; ϕ)] pages 2333–2338. ACM, 2013.
[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
K [13] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised
1Õ 2 2 2 learning with deep generative models. In NIPS, pages 3581–3589, 2014.
= (1 + log σs,k − µ s,k − σs,k ) (11)
2 [14] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
k=1 [15] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings.
In NIPS, pages 1042–1050, 2009.
D K L (Q(v |d; ϕ) k P(v)) can be derived in the same way. The [16] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding
expectation terms in Eqn.(10) do not have a closed form solution, with deep neural networks. In CVPR, pages 3270–3278, 2015.
[17] H. Larochelle and I. Murray. The neural autoregressive distribution estimator.
but we can approximate them by the Monte Carlo simulation as In AISTATS, volume 1, page 2, 2011.
follows: [18] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444,
2015.
[19] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash
codes for fast image retrieval. In CVPR Workshops, pages 27–35, 2015.
EQ (s,v) [log P(d | f (s + v; θ ))] + EQ (s,v) [log P(Y |s)] [20] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with
kernels. In CVPR, pages 2074–2081. IEEE, 2012.
M
[21] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML,
1 Õ
= log P(d | f (s (m) + v (m) ; θ )) + log P(Y |s (m) ) (12) pages 1–8, 2011.
M m=1 [22] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 9(Nov):2579–
2605, 2008.
[23] C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information re-
The superscript m denotes the m th sample. By shift and scale trieval, volume 1. Cambridge: Cambridge university press, 2008.
(m) [24] Y. Miao, L. Yu, and P. Blunsom. Neural variational inference for text processing.
transformation, we have s (m) = ϵs ⊙ σs + µ s . We denote ϵs
In ICML, 2016.
as a sample drawn from a standard multivariate normal and ⊙ is [25] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an
an element-wise multiplication. Also, v (m) is obtained in the same algorithm. In NIPS, volume 14, pages 849–856, 2001.
(m) [26] E. Reiter, R. Dale, and Z. Feng. Building natural language generation systems,
way, v (m) = ϵv ⊙σv + µv . By using this trick, we can obtain mul- volume 33. Cambridge: Cambridge university Press, 2000.
tiple samples of ϵ and feed them as the deterministic input to the [27] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and
approximate inference in deep generative models. ICML, 2014.
neural network. The model becomes an end-to-end deterministic [28] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of
deep neural network with the following objective function: Approximate Reasoning, 50(7):969–978, 2009.
[29] A. Severyn and A. Moschitti. Learning to rank short text pairs with convolu-
tional deep neural networks. In SIGIR, pages 373–382. ACM, 2015.
[30] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. JMLR,
15(1):1929–1958, 2014.
[31] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families,
and variational inference. Foundations and Trends® in Machine Learning, 1(1–
2):1–305, 2008.
[32] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image
retrieval. In CVPR, pages 3424–3431. IEEE, 2010.
[33] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big
data - a survey. Proceedings of the IEEE, 104(1):34–57, 2016.
[34] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang. Learning hash codes with listwise
supervision. In ICCV, pages 3032–3039, 2013.
[35] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to
hash. TPAMI, 2017.
[36] Q. Wang, D. Zhang, and L. Si. Semantic hashing using tags and topic modeling.
In SIGIR, pages 213–222. ACM, 2013.
[37] W. Wang, H. Lee, and K. Livescu. Deep variational canonical correlation analy-
sis. arXiv preprint arXiv:1610.03454, 2016.
[38] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760,
2009.
[39] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval
via image representation learning. In AAAI, volume 1, page 2, 2014.
[40] J. Xu, P. Wang, G. Tian, B. Xu, J. Zhao, F. Wang, and H. Hao. Convolutional
neural networks for text hashing. In AAAI, pages 1369–1375. AAAI Press, 2015.
[41] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity
search. In SIGIR, pages 18–25. ACM, 2010.

An Efficient and Robust Semantic Hashing Framework for Similar Text Search
No ratings yet
An Efficient and Robust Semantic Hashing Framework for Similar Text Search
31 pages
Nash
No ratings yet
Nash
10 pages
Principles of Hash-Based Text Retrieval.
100% (1)
Principles of Hash-Based Text Retrieval.
8 pages
A Mixed Generative-Discriminative
No ratings yet
A Mixed Generative-Discriminative
13 pages
A Survey On Learning To Hash
No ratings yet
A Survey On Learning To Hash
22 pages
1 s2.0 S0031320317304120 Main
No ratings yet
1 s2.0 S0031320317304120 Main
24 pages
p11
No ratings yet
p11
2 pages
Article cvprw15
No ratings yet
Article cvprw15
9 pages
Composite Hashing With Multiple Information Sources: Dan Zhang Fei Wang Luo Si
No ratings yet
Composite Hashing With Multiple Information Sources: Dan Zhang Fei Wang Luo Si
10 pages
Semantic_Technology-Assisted_Review_STAR_Document_
No ratings yet
Semantic_Technology-Assisted_Review_STAR_Document_
14 pages
s41019-024-00248-9
No ratings yet
s41019-024-00248-9
36 pages
Similarity-Based Techniques For Text Document Classification
No ratings yet
Similarity-Based Techniques For Text Document Classification
8 pages
A hash centroid construction method with Swin transformer for multi-label image retrieval
No ratings yet
A hash centroid construction method with Swin transformer for multi-label image retrieval
17 pages
Deep Learning Approaches for Similarity Computation a Survey
No ratings yet
Deep Learning Approaches for Similarity Computation a Survey
20 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
Deep Learning For Information Retrieval
No ratings yet
Deep Learning For Information Retrieval
136 pages
Conclusion Future Enhancement and Bibliography
No ratings yet
Conclusion Future Enhancement and Bibliography
3 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DeepDual-SD Deep Dual Attribute-Aware Embedding For Binary Code Similarity Detection-2023
No ratings yet
DeepDual-SD Deep Dual Attribute-Aware Embedding For Binary Code Similarity Detection-2023
11 pages
CSQ--Yuan Central Similarity Quantization for Efficient Image and Video Retrieval CVPR 2020 Paper
No ratings yet
CSQ--Yuan Central Similarity Quantization for Efficient Image and Video Retrieval CVPR 2020 Paper
10 pages
paper5
No ratings yet
paper5
10 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning Based Text To Image Genera
No ratings yet
Deep Learning Based Text To Image Genera
6 pages
Understanding Short Texts Through Semantic
No ratings yet
Understanding Short Texts Through Semantic
5 pages
Learning To Hash With Binary Deep Neural Network: October 2016
No ratings yet
Learning To Hash With Binary Deep Neural Network: October 2016
17 pages
Verisimilar Image Synthesis For Accurate Detection and Recognition of Texts in Scenes
No ratings yet
Verisimilar Image Synthesis For Accurate Detection and Recognition of Texts in Scenes
18 pages
Representing Structured Relational Data in Euclidean Vector Spaces
No ratings yet
Representing Structured Relational Data in Euclidean Vector Spaces
14 pages
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
No ratings yet
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
6 pages
Learning To Hash For Indexing Big Data - A Survey
No ratings yet
Learning To Hash For Indexing Big Data - A Survey
22 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
DYNAMIC COGNITIVE ONTOLOGY NETWORKS: ADVANCED INTEGRATION OF NEUROMORPHIC EVENT PROCESSING AND TROPICAL HYPER DIMENSIONAL REPRESENTATIONS
No ratings yet
DYNAMIC COGNITIVE ONTOLOGY NETWORKS: ADVANCED INTEGRATION OF NEUROMORPHIC EVENT PROCESSING AND TROPICAL HYPER DIMENSIONAL REPRESENTATIONS
20 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
A Literature Survey On Various Approaches On Content Based Image Search
No ratings yet
A Literature Survey On Various Approaches On Content Based Image Search
6 pages
SSDH
No ratings yet
SSDH
7 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Bit Reduction for Locality-Sensitive Hashing
No ratings yet
Bit Reduction for Locality-Sensitive Hashing
12 pages
Text feature extraction based on deep learning A review (PRINTED)
No ratings yet
Text feature extraction based on deep learning A review (PRINTED)
12 pages
Admin, 4015
No ratings yet
Admin, 4015
19 pages
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Paper 125
No ratings yet
Paper 125
11 pages
Journal.pone.0321114
No ratings yet
Journal.pone.0321114
25 pages
Dedupe Theory
No ratings yet
Dedupe Theory
152 pages
Compusoft, 3 (7), 1020-1023 PDF
No ratings yet
Compusoft, 3 (7), 1020-1023 PDF
4 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
Data Redundancy Using LSTM
No ratings yet
Data Redundancy Using LSTM
24 pages
Attribute Discovery Via Predictable Discriminative Binary Codes
No ratings yet
Attribute Discovery Via Predictable Discriminative Binary Codes
14 pages
2306.08121v2
No ratings yet
2306.08121v2
12 pages
li13a
No ratings yet
li13a
9 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
s00521-022-07339-6
No ratings yet
s00521-022-07339-6
17 pages
Brain Tumor Classification Using Deep Learning
No ratings yet
Brain Tumor Classification Using Deep Learning
3 pages
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
No ratings yet
Locality-Sensitive Hashing Scheme Based On P-Stable Distributions
10 pages
Similarity-Based Heterogeneous Neural Networks: Llu Is A. Belanche Mu Noz Julio Jos e Vald Es Ramos
No ratings yet
Similarity-Based Heterogeneous Neural Networks: Llu Is A. Belanche Mu Noz Julio Jos e Vald Es Ramos
14 pages
Paper 2
No ratings yet
Paper 2
9 pages
research paper 2
No ratings yet
research paper 2
7 pages
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
Chapter Transformers
No ratings yet
Chapter Transformers
8 pages
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
No ratings yet
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
12 pages
Description_of_approach
No ratings yet
Description_of_approach
5 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
CoverLetter Praveen [8+Yrs][MSDS] Adobe MLE
No ratings yet
CoverLetter Praveen [8+Yrs][MSDS] Adobe MLE
1 page
AWS Ramp-Up Guide: Machine Learning: For Developers, Engineers and Data Scientists
No ratings yet
AWS Ramp-Up Guide: Machine Learning: For Developers, Engineers and Data Scientists
2 pages
Project Preliminary Report Sample-pages-Deleted
No ratings yet
Project Preliminary Report Sample-pages-Deleted
38 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
843-AI-XII
No ratings yet
843-AI-XII
12 pages
2022RCIM-Learning-based Object Detection and Localization For A Mobile Robotmanipulator in SME Production
No ratings yet
2022RCIM-Learning-based Object Detection and Localization For A Mobile Robotmanipulator in SME Production
12 pages
Recent Deep Learning Based NLP Techniques For Chatbot Development An Exhaustive Survey
No ratings yet
Recent Deep Learning Based NLP Techniques For Chatbot Development An Exhaustive Survey
4 pages
University of Mumbai
No ratings yet
University of Mumbai
5 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
Amazon Products Review Sentiment Analysis
No ratings yet
Amazon Products Review Sentiment Analysis
23 pages
Disease Detection On The Leaves of The Tomato Plants by Using Deep Learning
No ratings yet
Disease Detection On The Leaves of The Tomato Plants by Using Deep Learning
5 pages
BSC Data Scienceand Business Analytics Prospectus
No ratings yet
BSC Data Scienceand Business Analytics Prospectus
22 pages
A Comparative Study of Behavior Analysis Sandboxes in Malware Detection
No ratings yet
A Comparative Study of Behavior Analysis Sandboxes in Malware Detection
7 pages
Python Interview Q A 1709697988
No ratings yet
Python Interview Q A 1709697988
26 pages
Intrusion Detection Method Based On SMOTE Transformation For Smart Grid Cybersecurity
No ratings yet
Intrusion Detection Method Based On SMOTE Transformation For Smart Grid Cybersecurity
6 pages
AML - Lab - Syllabus - Chandigarh University
No ratings yet
AML - Lab - Syllabus - Chandigarh University
9 pages
BDM Curriculum 1665047518017
No ratings yet
BDM Curriculum 1665047518017
2 pages
DA Marathon Notes
No ratings yet
DA Marathon Notes
134 pages
NCDS Brochure v4
No ratings yet
NCDS Brochure v4
6 pages
A Market Approach To Forecasting: Background, Theory and Practice
No ratings yet
A Market Approach To Forecasting: Background, Theory and Practice
220 pages
Artificial Intelligence in Dermatology
No ratings yet
Artificial Intelligence in Dermatology
4 pages
PHASE 2 Personalised Learning Management System 2
No ratings yet
PHASE 2 Personalised Learning Management System 2
12 pages
Research
No ratings yet
Research
22 pages
C1000-154 STU C1000154v2STUSGC1000154
No ratings yet
C1000-154 STU C1000154v2STUSGC1000154
10 pages
EPAT Brochure
No ratings yet
EPAT Brochure
24 pages
ETI Project
No ratings yet
ETI Project
16 pages
An Overview of Tensorflow + Deep learning 沒一村
No ratings yet
An Overview of Tensorflow + Deep learning 沒一村
31 pages
Shreshtth
No ratings yet
Shreshtth
2 pages
Deep End-To-End Learning For Price Prediction of S
No ratings yet
Deep End-To-End Learning For Price Prediction of S
29 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages