0% found this document useful (0 votes)
5 views

Nash

The paper introduces NASH, an end-to-end neural architecture for generative semantic hashing that optimizes binary hashing codes for documents in a single training process. By employing a neural variational inference framework, it allows for direct gradient backpropagation through discrete latent variables, enhancing the efficiency of similarity searches. Experimental results demonstrate that NASH significantly outperforms existing state-of-the-art models in both unsupervised and supervised scenarios.

Uploaded by

bingdiao0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Nash

The paper introduces NASH, an end-to-end neural architecture for generative semantic hashing that optimizes binary hashing codes for documents in a single training process. By employing a neural variational inference framework, it allows for direct gradient backpropagation through discrete latent variables, enhancing the efficiency of similarity searches. Experimental results demonstrate that NASH significantly outperforms existing state-of-the-art models in both unsupervised and supervised scenarios.

Uploaded by

bingdiao0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

NASH: Toward End-to-End Neural Architecture for Generative Semantic

Hashing

Dinghan Shen1∗ , Qinliang Su2∗ , Paidamoyo Chapfuwa1 ,


Wenlin Wang1 , Guoyin Wang1 , Lawrence Carin1 , Ricardo Henao1
1 2
Duke University Sun Yat-sen University
[email protected]

Abstract et al., 2010; Wang et al., 2014). By represent-


ing every document in the corpus as a similarity-
Semantic hashing has become a power- preserving discrete (binary) hashing code, the
ful paradigm for fast similarity search similarity between two documents can be evalu-
arXiv:1805.05361v1 [cs.CL] 14 May 2018

in many information retrieval systems. ated by simply calculating pairwise Hamming dis-
While fairly successful, previous tech- tances between hashing codes, i.e., the number of
niques generally require two-stage train- bits that are different between two codes. Given
ing, and the binary constraints are han- that today, an ordinary PC is able to execute mil-
dled ad-hoc. In this paper, we present lions of Hamming distance computations in just a
an end-to-end Neural Architecture for Se- few milliseconds (Zhang et al., 2010), this seman-
mantic Hashing (NASH), where the binary tic hashing strategy is very computationally attrac-
hashing codes are treated as Bernoulli la- tive.
tent variables. A neural variational in-
ference framework is proposed for train- While considerable research has been devoted
ing, where gradients are directly back- to text (semantic) hashing, existing approaches
propagated through the discrete latent typically require two-stage training procedures.
variable to optimize the hash function. These methods can be generally divided into two
We also draw connections between pro- categories: (i) binary codes for documents are first
posed method and rate-distortion the- learned in an unsupervised manner, then l binary
ory, which provides a theoretical foun- classifiers are trained via supervised learning to
dation for the effectiveness of the pro- predict the l-bit hashing code (Zhang et al., 2010;
posed framework. Experimental results on Xu et al., 2015); (ii) continuous text representa-
three public datasets demonstrate that our tions are first inferred, which are binarized as a
method significantly outperforms several second (separate) step during testing (Wang et al.,
state-of-the-art models on both unsuper- 2013; Chaidaroon and Fang, 2017). Because the
vised and supervised scenarios. model parameters are not learned in an end-to-end
manner, these two-stage training strategies may re-
1 Introduction sult in suboptimal local optima. This happens be-
cause different modules within the model are opti-
The problem of similarity search, also called mized separately, preventing the sharing of infor-
nearest-neighbor search, consists of finding doc- mation between them. Further, in existing meth-
uments from a large collection of documents, or ods, binary constraints are typically handled ad-
corpus, which are most similar to a query doc- hoc by truncation, i.e., the hashing codes are ob-
ument of interest. Fast and accurate similarity tained via direct binarization from continuous rep-
search is at the core of many information retrieval resentations after training. As a result, the in-
applications, such as plagiarism analysis (Stein formation contained in the continuous representa-
et al., 2007), collaborative filtering (Koren, 2008), tions is lost during the (separate) binarization pro-
content-based multimedia retrieval (Lew et al., cess. Moreover, training different modules (map-
2006) and caching (Pandey et al., 2009). Semantic ping and classifier/binarization) separately often
hashing is an effective approach for fast similarity requires additional hyperparameter tuning for each
search (Salakhutdinov and Hinton, 2009; Zhang training stage, which can be laborious and time-

Equal contribution. consuming.
In this paper, we propose a simple and generic <latexit sha1_base64="4gsoFBpBBAbmyfn2ZeNA3fTqK6U=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmKCF7JrSNQb0YtHTFzBwIZ0Sxca2u6m7RrJhl/hxYMar/4db/4bC+xBwZdM8vLeTGbmhQln2rjut1NYWV1b3yhulra2d3b3yvsH9zpOFaE+iXms2iHWlDNJfcMMp+1EUSxCTlvh6Hrqtx6p0iyWd2ac0EDggWQRI9hY6WHQ6yZDVn067ZUrbs2dAS0TLycVyNHslb+6/ZikgkpDONa647mJCTKsDCOcTkrdVNMEkxEe0I6lEguqg2x28ASdWKWPoljZkgbN1N8TGRZaj0VoOwU2Q73oTcX/vE5qoosgYzJJDZVkvihKOTIxmn6P+kxRYvjYEkwUs7ciMsQKE2MzKtkQvMWXl4l/Vrusubf1SuMqT6MIR3AMVfDgHBpwA03wgYCAZ3iFN0c5L8678zFvLTj5zCH8gfP5A5/Qj9M=</latexit>
g (x) z
<latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit>
<latexit

0.1 0
neural architecture for text hashing that learns bi- z 0 ⇠ N (z, I)
nary latent codes for documents in an end-to- 1
<latexit sha1_base64="oL/kS60CrA7r8ceuyOQmklSUP/Y=">AAACD3icbVBNS8NAEN3Ur1q/oh69LBaxgpREBPVW9KIXqWBsoYlls922S3eTsLsR2pCf4MW/4sWDilev3vw3btoctPXBwOO9GWbm+RGjUlnWt1GYm19YXCoul1ZW19Y3zM2tOxnGAhMHhywUTR9JwmhAHEUVI81IEMR9Rhr+4CLzGw9ESBoGt2oYEY+jXkC7FCOlpba5P7pP3EhQTlJXUg5djlQfI5Zcp5XRIdRajyN4ddA2y1bVGgPOEjsnZZCj3ja/3E6IY04ChRmSsmVbkfISJBTFjKQlN5YkQniAeqSlaYA4kV4yfiiFe1rpwG4odAUKjtXfEwniUg65rzuze+W0l4n/ea1YdU+9hAZRrEiAJ4u6MYMqhFk6sEMFwYoNNUFYUH0rxH0kEFY6w5IOwZ5+eZY4R9WzqnVzXK6d52kUwQ7YBRVggxNQA5egDhyAwSN4Bq/gzXgyXox342PSWjDymW3wB8bnD54+nHs=</latexit>

0.9
MLP

end manner. Inspired by recent advances in neu- 0.7 1

ral variational inference (NVI) for text processing 0.3 0


(Miao et al., 2016; Yang et al., 2017; Shen et al.,
2017b), we approach semantic hashing from a x
<latexit sha1_base64="wrYRrS9nqr2/jTKdHNfdRLtLB0k=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Ygl9Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9UTYzP</latexit>
<latexit

log 2 x̂
<latexit sha1_base64="9fy0Mz7X/Akugg6I9ARal+HBkn4=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtoQ9lsN+3SzSbsTsQS+iO8eFDx6v/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmlldW19o7xZ2dre2d2r7h88mCTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXJtRKLucZzyIKYDJSLBKFqp1R1SzJ8mvWrNrbszkGXiFaQGBZq96le3n7As5gqZpMZ0PDfFIKcaBZN8UulmhqeUjeiAdyxVNOYmyGfnTsiJVfokSrQthWSm/p7IaWzMOA5tZ0xxaBa9qfif18kwugxyodIMuWLzRVEmCSZk+jvpC80ZyrEllGlhbyVsSDVlaBOq2BC8xZeXiX9Wv6q7d+e1xnWRRhmO4BhOwYMLaMAtNMEHBiN4hld4c1LnxXl3PuatJaeYOYQ/cD5/ABu5j5w=</latexit>

generative model perspective, where binary (hash- <latexit sha1_base64="7fXReuSi2AGXHQbFX8oagcVUXco=">AAAB83icbVBNS8NAEJ34WetX1aOXxSJ4KkkR1FvRi8cKxhaaWDbbTbp0Nxt3N4VS+ju8eFDx6p/x5r9x2+agrQ8GHu/NMDMvyjjTxnW/nZXVtfWNzdJWeXtnd2+/cnD4oGWuCPWJ5FK1I6wpZyn1DTOctjNFsYg4bUWDm6nfGlKlmUzvzSijocBJymJGsLFSGHCZoECzRODHerdSdWvuDGiZeAWpQoFmt/IV9CTJBU0N4VjrjudmJhxjZRjhdFIOck0zTAY4oR1LUyyoDsezoyfo1Co9FEtlKzVopv6eGGOh9UhEtlNg09eL3lT8z+vkJr4MxyzNckNTMl8U5xwZiaYJoB5TlBg+sgQTxeytiPSxwsTYnMo2BG/x5WXi12tXNffuvNq4LtIowTGcwBl4cAENuIUm+EDgCZ7hFd6cofPivDsf89YVp5g5gj9wPn8AnmWRig==</latexit>

ing) codes are represented as either deterministic Figure 1: NASH for end-to-end semantic hashing.
or stochastic Bernoulli latent variables. The infer- The inference network maps x → z using an MLP
ence (encoder) and generative (decoder) networks and the generative network recovers x as z → x̂.
are optimized jointly by maximizing a variational
lower bound to the marginal distribution of input
documents (corpus). By leveraging a simple and van den Oord et al. (2017) combined VAEs with
effective method to estimate the gradients with re- vector quantization to learn discrete latent repre-
spect to discrete (binary) variables, the loss term sentation, and demonstrated the utility of these
from the generative (decoder) network can be di- learned representations on images, videos, and
rectly backpropagated into the inference (encoder) speech data. Li et al. (2017) leveraged both pair-
network to optimize the hash function. wise label and classification information to learn
discrete hash codes, which exhibit state-of-the-art
Motivated by the rate-distortion theory (Berger,
performance on image retrieval tasks.
1971; Theis et al., 2017), we propose to inject
data-dependent noise into the latent codes during For natural language processing (NLP), al-
the decoding stage, which adaptively accounts for though significant research has been made to learn
the tradeoff between minimizing rate (number of continuous deep representations for words or doc-
bits used, or effective code length) and distortion uments (Mikolov et al., 2013; Kiros et al., 2015;
(reconstruction error) during training. The con- Shen et al., 2018), discrete neural representations
nection between the proposed method and rate- have been mainly explored in learning word em-
distortion theory is further elucidated, providing a beddings (Shu and Nakayama, 2017; Chen et al.,
theoretical foundation for the effectiveness of our 2017). In these recent works, words are repre-
framework. sented as a vector of discrete numbers, which are
very efficient storage-wise, while showing compa-
Summarizing, the contributions of this paper
rable performance on several NLP tasks, relative
are: (i) to the best of our knowledge, we present
to continuous word embeddings. However, dis-
the first semantic hashing architecture that can
crete representations that are learned in an end-
be trained in an end-to-end manner; (ii) we pro-
to-end manner at the sentence or document level
pose a neural variational inference framework to
have been rarely explored. Also there is a lack of
learn compact (regularized) binary codes for doc-
strict evaluation regarding their effectiveness. Our
uments, achieving promising results on both unsu-
work focuses on learning discrete (binary) repre-
pervised and supervised text hashing; (iii) the con-
sentations for text documents. Further, we em-
nection between our method and rate-distortion
ploy semantic hashing (fast similarity search) as
theory is established, from which we demonstrate
a mechanism to evaluate the quality of learned bi-
the advantage of injecting data-dependent noise
nary latent codes.
into the latent variable during training.

2 Related Work 3 The Proposed Method


3.1 Hashing under the NVI Framework
Models with discrete random variables have at-
tracted much attention in the deep learning com- Inspired by the recent success of variational au-
munity (Jang et al., 2016; Maddison et al., 2016; toencoders for various NLP problems (Miao et al.,
van den Oord et al., 2017; Li et al., 2017; Shu and 2016; Bowman et al., 2015; Yang et al., 2017;
Nakayama, 2017). Some of these structures are Miao et al., 2017; Shen et al., 2017b; Wang et al.,
more natural choices for language or speech data, 2018), we approach the training of discrete (bi-
which are inherently discrete. More specifically, nary) latent variables from a generative perspec-
tive. Let x and z denote the input document and samples from the Bernoulli posterior either deter-
its corresponding binary hash code, respectively. ministically or stochastically. Suppose z is a l-bit
Most of the previous text hashing methods focus hash code, for the deterministic binarization, we
on modeling the encoding distribution p(z|x), or have, for i = 1, 2, ......, l:
hash function, so the local/global pairwise simi-
larity structure of documents in the original space sign(σ(gφi (x) − 0.5) + 1
zi = 1σ(gi (x))>0.5 = ,
is preserved in latent space (Zhang et al., 2010; φ 2
Wang et al., 2013; Xu et al., 2015; Wang et al., (1)
2014). However, the generative (decoding) pro-
where z is the binarized variable, and zi and gφi (x)
cess of reconstructing x from binary latent code z,
denote the i-th dimension of z and gφ (x), respec-
i.e., modeling distribution p(x|z), has been rarely
tively. The standard Bernoulli sampling in (1) can
considered. Intuitively, latent codes learned from a
be understood as setting a hard threshold at 0.5
model that accounts for the generative term should
for each representation dimension, therefore, the
naturally encapsulate key semantic information
binary latent code is generated deterministically.
from x because the generation/reconstruction ob-
Another strategy to obtain the discrete variable is
jective is a function of p(x|z). In this regard, the
to binarize h in a stochastic manner:
generative term provides a natural training objec-
tive for semantic hashing. sign(σ(gφi (x)) − µi ) + 1
zi = 1σ(gi (x))>µi = ,
We define a generative model that simultane- φ 2
ously accounts for both the encoding distribu- (2)
tion, p(z|x), and decoding distribution, p(x|z), where µi ∼ Uniform(0, 1). Because of this sam-
by defining approximations qφ (z|x) and qθ (x|z), pling process, we do not have to assume a pre-
via inference and generative networks, gφ (x) and defined threshold value like in (1).
gθ (z), parameterized by φ and θ, respectively.
|V | 3.2 Training with Binary Latent Variables
Specifically, x ∈ Z+ is the bag-of-words (count)
representation for the input document, where |V | To estimate the parameters of the encoder and
is the vocabulary size. Notably, we can also em- decoder networks, we would ideally maximize
ploy other count weighting schemes as input fea-
R
the marginal distribution p(x) = p(z)p(x|z)dz.
tures x, e.g., the term frequency-inverse document However, computing this marginal is intractable
frequency (TFIDF) (Manning et al., 2008). For in most cases of interest. Instead, we maximize
the encoding distribution, a latent variable z is a variational lower bound. This approach is typ-
first inferred from the input text x, by construct- ically employed in the VAE framework (Kingma
ing an inference network gφ (x) to approximate and Welling, 2013):
the true posterior distribution p(z|x) as qφ (z|x).  
Subsequently, the decoder network gθ (z) maps z qθ (x|z)p(z)
Lvae = Eqφ (z|x) log , (3)
back into input space to reconstruct the original qφ (z|x)
sequence x as x̂, approximating p(x|z) as qθ (x|z) = Eqφ (z|x) [log qθ (x|z)] − DKL (qφ (z|x)||p(z)),
(as shown in Figure 1). This cyclic strategy, x →
z → x̂ ≈ x, provides the latent variable z with a where the Kullback-Leibler (KL) divergence
better ability to generalize (Miao et al., 2016). DKL (qφ (z|x)||p(z)) encourages the approximate
posterior distribution qφ (z|x) to be close to the
To tailor the NVI framework for semantic hash-
multivariate Bernoulli prior p(z). In this case,
ing, we cast z as a binary latent variable and as-
DKL (qφ (z|x)|p(z)) can be written in closed-form
sume a multivariateQBernoulli prior on z: p(z) ∼
l zi 1−zi , where as a function of gφ (x):
Bernoulli(γ) = i=1 γi (1 − γi )
γi ∈ [0, 1] is component i of vector γ. Thus, gφ (x)
the encoding (approximate posterior) distribution DKL = gφ (x) log
γ
qφ (z|x) is restricted to take the form qφ (z|x) = 1 − gφ (x)
Bernoulli(h), where h = σ(gφ (x)), σ(·) is the sig- + (1 − gφ (x)) log . (4)
1−γ
moid function, and gφ (·) is the (nonlinear) infer-
ence network specified as a multilayer perceptron Note that the gradient for the KL divergence term
(MLP). As illustrated in Figure 1, we can obtain above can be evaluated easily.
For the first term in (3), we should in principle that appear in the input document x. As shown in
estimate the influence of µi in (2) on qθ (x|z) by Section 5.3.1, meaningful semantic structures can
averaging over the entire (uniform) noise distribu- be learned and manifested in the word embedding
tion. However, a closed-form distribution does not matrix E.
exist since it is not possible to enumerate all possi-
ble configurations of z, especially when the latent 3.3 Injecting Data-dependent Noise to z
dimension is large. Moreover, discrete latent vari- To reconstruct text data x from sampled binary
ables are inherently incompatible with backpropa- representation z, a deterministic decoder is typi-
gation, since the derivative of the sign function is cally utilized (Miao et al., 2016; Chaidaroon and
zero for almost all input values. As a result, the Fang, 2017). Inspired by the success of employing
exact gradients of Lvae wrt the inputs before bina- stochastic decoders in image hashing applications
rization would be essentially all zero. (Dai et al., 2017; Theis et al., 2017), in our exper-
To estimate the gradients for binary latent vari- iments, we found that injecting random Gaussian
ables, we utilize the straight-through (ST) estima- noise into z makes the decoder a more favorable
tor, which was first introduced by Hinton (2012). regularizer for the binary codes, which in practice
So motivated, the strategy here is to simply back- leads to stronger retrieval performance. Below, we
propagate through the hard threshold by approxi- invoke the rate-distortion theory to perform some
mating the gradient ∂z/∂φ as 1. Thus, we have: further analysis, which leads to interesting find-
ings.
dEqφ (z|x) [log qθ (x|z)] Learning binary latent codes z to represent a
∂φ continuous distribution p(x) is a classical informa-
dEqφ (z|x) [log qθ (x|z)] dz dσ(gφi (x)) tion theory concept known as lossy source coding.
= From this perspective, semantic hashing, which
dz dσ(gφi (x)) dφ
compresses an input document into compact bi-
dEqφ (z|x) [log qθ (x|z)] dσ(gφi (x))
≈ (5) nary codes, can be casted as a conventional rate-
dz dφ distortion tradeoff problem (Theis et al., 2017;
Although this is clearly a biased estimator, it has Ballé et al., 2016):
been shown to be a fast and efficient method rela-
min − log2 R(z) +β ·D(x, x̂) , (7)
tive to other gradient estimators for discrete vari- | {z } | {z }
ables, especially for the Bernoulli case (Bengio Rate Distortion

et al., 2013; Hubara et al., 2016; Theis et al., where rate and distortion denote the effective code
2017). With the ST gradient estimator, the first length, i.e., the number of bits used, and the dis-
loss term in (3) can be backpropagated into the tortion introduced by the encoding/decoding se-
encoder network to fine-tune the hash function quence, respectively. Further, x̂ is the recon-
gφ (x). structed input and β is a hyperparameter that con-
For the approximate generator qθ (x|z) in (3), let trols the tradeoff between the two terms.
xi denote the one-hot representationPof ith word Considering the case where we have a Bernoulli
within a document. Note that x = i xi is thus prior on z as p(z) ∼ Bernoulli(γ), and x
the bag-of-words representation for document x. conditionally drawn from a Gaussian distribution
To reconstruct the input x from z, we utilize a soft- p(x|z) ∼ N (Ez, σ 2 I). Here, E = {ei }i=1 ,
|V |
max decoding function written as: where ei ∈ Rd can be interpreted as a codebook
exp(z T Exw + bw ) with |V | codewords. In our case, E corresponds
q(xi = w|z) = P|V | , (6) to the word embedding matrix as in (6).
exp(z T Ex + b )
j=1 j j
For the case of stochastic latent variable z, the
where q(xi = w|z) is the Q probability that xi is objective function in (3) can be written in a form
word w ∈ V , qθ (x|z) = i q(xi = w|z) and similar to the rate-distortion tradeoff:
θ = {E, b1 , . . . , b|V | }. Note that E ∈ Rd×|V | can  
be interpreted as a word embedding matrix to be  1 
2
|V |
learned, and {bi }i=1 denote bias terms. Intuitively, − log qφ (z|x) + 2 ||x − Ez||2 +C  ,
min Eqφ (z|x)  
| {z } 2σ |
|{z} {z }
the objective in (6) encourages the discrete vector Rate
β Distortion
z to be close to the embeddings for every word (8)
where C is a constant that encapsulates the prior (i) Reuters21578, containing 10,788 news docu-
distribution p(z) and the Gaussian distribution ments, which have been classified into 90 differ-
normalization term. Notably, the trade-off hyper- ent categories. (ii) 20Newsgroups, a collection of
parameter β = σ −2 /2 is closely related to the 18,828 newsgroup documents, which are catego-
variance of the distribution p(x|z). In other words, rized into 20 different topics. (iii) TMC (stands
by controlling the variance σ, the model can adap- for SIAM text mining competition), containing air
tively explore different trade-offs between the rate traffic reports provided by NASA. TMC consists
and distortion objectives. However, the optimal 21,519 training documents divided into 22 differ-
trade-offs for distinct samples may be different. ent categories. To make direct comparison with
Inspired by the observations above, we propose prior works, we employed the TFIDF features on
to inject data-dependent noise into latent variable these datasets supplied by (Chaidaroon and Fang,
z, rather than to setting the variance term σ 2 to a 2017), where the vocabulary sizes for the three
fixed value (Dai et al., 2017; Theis et al., 2017). datasets are set to 10,000, 7,164 and 20,000, re-
Specifically, log σ 2 is obtained via a one-layer spectively.
MLP transformation from gφ (x). Afterwards, we
sample z 0 from N (z, σ 2 I), which then replace z in 4.2 Training Details
(6) to infer the probability of generating individual For the inference networks, we employ a feed-
words (as shown in Figure 1). As a result, the vari- forward neural network with 2 hidden layers (both
ances are different for every input document x, and with 500 units) using the ReLU non-linearity ac-
thus the model is provided with additional flexibil- tivation function, which transform the input doc-
ity to explore various trade-offs between rate and uments, i.e., TFIDF features in our experiments,
distortion for different training observations. Al- into a continuous representation. Empirically, we
though our decoder is not a strictly Gaussian dis- found that stochastic binarization as in (2) shows
tribution, as in (6), we found empirically that in- stronger performance than deterministic binariza-
jecting data-dependent noise into z yields strong tion, and thus use the former in our experiments.
retrieval results, see Section 5.1. However, we further conduct a systematic ablation
study in Section 5.2 to compare the two binariza-
3.4 Supervised Hashing
tion strategies.
The proposed Neural Architecture for Semantic Our model is trained using Adam (Kingma and
Hashing (NASH) can be extended to supervised Ba, 2014), with a learning rate of 1 × 10−3 for all
hashing, where a mapping from latent variable z parameters. We decay the learning rate by a fac-
to labels y is learned, here parametrized by a two- tor of 0.96 for every 10,000 iterations. Dropout
layer MLP followed by a fully-connected softmax (Srivastava et al., 2014) is employed on the output
layer. To allow the model to explore and balance of encoder networks, with the rate selected from
between maximizing the variational lower bound {0.7, 0.8, 0.9} on the validation set. To facilitate
in (3) and minimizing the discriminative loss, the comparisons with previous methods, we set the di-
following joint training objective is employed: mension of z, i.e., the number of bits within the
hashing code) as 8, 16, 32, 64, or 128.
L = −Lvae (θ, φ; x) + αLdis (η; z, y). (9)
4.3 Baselines
where η refers to parameters of the MLP classi-
fier and α controls the relative weight between We evaluate the effectiveness of our framework on
the variational lower bound (Lvae ) and discrimina- both unsupervised and supervised semantic hash-
tive loss (Ldis ), defined as the cross-entropy loss. ing tasks. We consider the following unsuper-
The parameters {θ, φ, η} are learned end-to-end vised baselines for comparisons: Locality Sensi-
via Monte Carlo estimation. tive Hashing (LSH) (Datar et al., 2004), Stack Re-
stricted Boltzmann Machines (S-RBM) (Salakhut-
4 Experimental Setup dinov and Hinton, 2009), Spectral Hashing (SpH)
(Weiss et al., 2009), Self-taught Hashing (STH)
4.1 Datasets (Zhang et al., 2010) and Variational Deep Se-
We use the following three standard publicly mantic Hashing (VDSH) (Chaidaroon and Fang,
available datasets for training and evaluation: 2017).
Method 8 bits 16 bits 32 bits 64 bits 128 bits 1.0
LSH 0.2802 0.3215 0.3862 0.4667 0.5194
S-RBM 0.5113 0.5740 0.6154 0.6177 0.6452

Precison (%)
SpH
STH
0.6080
0.6616
0.6340
0.7351
0.6513
0.7554
0.6290
0.7350
0.6045
0.6986 0.9 KSH
VDSH 0.6859 0.7165 0.7753 0.7456 0.7318 SHTTM
NASH 0.7113 0.7624 0.7993 0.7812 0.7559 VDSH-S
VDSH-SP
NASH-N
NASH-DN
0.7352
0.7470
0.7904
0.8013
0.8297
0.8418
0.8086
0.8297
0.7867
0.7924
0.8 NASH-DN-S

816 32 64 128
Table 1: Precision of the top 100 retrieved docu- Number of Bits
ments on Reuters dataset (Unsupervised hashing).
Figure 2: Precision of the top 100 retrieved doc-
For supervised semantic hashing, we also com- uments on Reuters dataset (Supervised hashing),
pare NASH against a number of baselines: Su- compared with other supervised baselines.
pervised Hashing with Kernels (KSH) (Liu et al.,
2012), Semantic Hashing using Tags and Topic 5.1 Semantic Hashing Evaluation
Modeling (SHTTM) (Wang et al., 2013) and Su-
Table 1 presents the results of all models on
pervised VDSH (Chaidaroon and Fang, 2017). It
Reuters dataset. Regarding unsupervised seman-
is worth noting that unlike all these baselines, our
tic hashing, all the NASH variants consistently
NASH model is trained end-to-end in one-step.
outperform the baseline methods by a substan-
4.4 Evaluation Metrics tial margin, indicating that our model makes the
most effective use of unlabeled data and manage
To evaluate the hashing codes for similarity to assign similar hashing codes, i.e., with small
search, we consider each document in the testing Hamming distance to each other, to documents
set as a query document. Similar documents to that belong to the same label. It can be also
the query in the corresponding training set need observed that the injection of noise into the de-
to be retrieved based on the Hamming distance of coder networks has improved the robustness of
their hashing codes, i.e. number of different bits. learned binary representations, resulting in better
To facilitate comparison with prior work (Wang retrieval performance. More importantly, by mak-
et al., 2013; Chaidaroon and Fang, 2017), the per- ing the variances of noise adaptive to the specific
formance is measured with precision. Specifically, input, our NASH-DN achieves even better results,
during testing, for a query document, we first re- compared with NASH-N, highlighting the impor-
trieve the 100 nearest/closest documents accord- tance of exploring/learning the trade-off between
ing to the Hamming distances of the correspond- rate and distortion objectives by the data itself.
ing hash codes (i.e., the number of different bits). We observe the same trend and superiority of our
We then examine the percentage of documents NASH-DN models on the other two benchmarks,
among these 100 retrieved ones that belong to the as shown in Tables 3 and 4.
same label (topic) with the query document (we
Another observation is that the retrieval results
consider documents having the same label as rel-
tend to drop a bit when we set the length of hash-
evant pairs). The ratio of the number of relevant
ing codes to be 64 or larger, which also happens
documents to the number of retrieved documents
for some baseline models. This phenomenon has
(fixed value of 100) is calculated as the precision
been reported previously in Wang et al. (2012);
score. The precision scores are further averaged
Liu et al. (2012); Wang et al. (2013); Chaida-
over all test (query) documents.
roon and Fang (2017), and the reasons could be
twofold: (i) for longer codes, the number of data
5 Experimental Results
points that are assigned to a certain binary code
We experimented with four variants for our NASH decreases exponentially. As a result, many queries
model: (i) NASH: with deterministic decoder; (ii) may fail to return any neighbor documents (Wang
NASH-N: with fixed random noise injected to de- et al., 2012); (ii) considering the size of train-
coder; (iii) NASH-DN: with data-dependent noise ing data, it is likely that the model may over-
injected to decoder; (iv) NASH-DN-S: NASH-DN fit with long hash codes (Chaidaroon and Fang,
with supervised information during training. 2017). However, even with longer hashing codes,
Word weapons medical companies define israel book
gun treatment company definition israeli books
guns disease market defined arabs english
NASH weapon drugs afford explained arab references
armed health products discussion jewish learning
assault medicine money knowledge jews reference
guns medicine expensive defined israeli books
weapon health industry definition arab reference
NVDM gun treatment company printf arabs guide
militia disease market int lebanon writing
armed patients buy sufficient lebanese pages

Table 2: The five nearest words in the semantic space learned by NASH, compared with the results from
NVDM (Miao et al., 2016).

Method 8 bits 16 bits 32 bits 64 bits 128 bits where the label or topic information is utilized dur-
Unsupervised Hashing
ing training. As shown in Figure 2, our NASH-
LSH 0.0578 0.0597 0.0666 0.0770 0.0949
S-RBM 0.0594 0.0604 0.0533 0.0623 0.0642 DN-S model consistently outperforms several su-
SpH 0.2545 0.3200 0.3709 0.3196 0.2716 pervised semantic hashing baselines, with vari-
STH 0.3664 0.5237 0.5860 0.5806 0.5443
ous choices of hashing bits. Notably, our model
VDSH 0.3643 0.3904 0.4327 0.1731 0.0522
NASH 0.3786 0.5108 0.5671 0.5071 0.4664 exhibits higher Top-100 retrieval precision than
NASH-N 0.3903 0.5213 0.5987 0.5143 0.4776 VDSH-S and VDSH-SP, proposed by Chaidaroon
NASH-DN 0.4040 0.5310 0.6225 0.5377 0.4945
and Fang (2017). This may be attributed to the fact
Supervised Hashing
KSH 0.4257 0.5559 0.6103 0.6488 0.6638 that in VDSH models, the continuous embeddings
SHTTM 0.2690 0.3235 0.2357 0.1411 0.1299 are not optimized with their future binarization in
VDSH-S 0.6586 0.6791 0.7564 0.6850 0.6916
mind, and thus could hurt the relevance of learned
VDSH-SP 0.6609 0.6551 0.7125 0.7045 0.7117
NASH-DN-S 0.6247 0.6973 0.8069 0.8213 0.7840 binary codes. On the contrary, our model is opti-
mized in an end-to-end manner, where the gradi-
Table 3: Precision of the top 100 retrieved docu- ents are directly backpropagated to the inference
ments on 20Newsgroups dataset. network (through the binary/discrete latent vari-
able), and thus gives rise to a more robust hash
Method 8 bits 16 bits 32 bits 64 bits 128 bits function.
Unsupervised Hashing
LSH 0.4388 0.4393 0.4514 0.4553 0.4773 5.2 Ablation study
S-RBM 0.4846 0.5108 0.5166 0.5190 0.5137
SpH 0.5807 0.6055 0.6281 0.6143 0.5891 5.2.1 The effect of stochastic sampling
STH 0.3723 0.3947 0.4105 0.4181 0.4123
VDSH 0.4330 0.6853 0.7108 0.4410 0.5847 As described in Section 3, the binary latent vari-
NASH 0.5849 0.6573 0.6921 0.6548 0.5998 ables z in NASH can be either deterministically
NASH-N 0.6233 0.6759 0.7201 0.6877 0.6314
NASH-DN 0.6358 0.6956 0.7327 0.7010 0.6325
(via (1)) or stochastically (via (2)) sampled. We
Supervised Hashing compare these two types of binarization functions
KSH 0.6608 0.6842 0.7047 0.7175 0.7243 in the case of unsupervised hashing. As illustrated
SHTTM 0.6299 0.6571 0.6485 0.6893 0.6474
VDSH-S 0.7387 0.7887 0.7883 0.7967 0.8018
in Figure 3, stochastic sampling shows stronger re-
VDSH-SP 0.7498 0.7798 0.7891 0.7888 0.7970 trieval results on all three datasets, indicating that
NASH-DN-S 0.7438 0.7946 0.7987 0.8014 0.8139 endowing the sampling process of latent variables
with more stochasticity improves the learned rep-
Table 4: Precision of the top 100 retrieved docu- resentations.
ments on TMC dataset.
5.2.2 The effect of encoder/decoder networks
our NASH models perform stronger than the base- Under the variational framework introduced here,
lines in most cases (except for the 20Newsgroups the encoder network, i.e., hash function, and de-
dataset), suggesting that NASH can effectively al- coder network are jointly optimized to abstract se-
locate documents to informative/meaningful hash- mantic features from documents. An interesting
ing codes even with limited training data. question concerns what types of network should
We also evaluate the effectiveness of NASH be leveraged for each part of our NASH model.
in a supervised scenario on the Reuters dataset, In this regard, we further investigate the effect of
Category Title/Subject 8-bit code 16-bit code
Dave Kingman for the hall of fame 11101001 0010110100000110
Time of game 11111001 0010100100000111
Baseball
Game score report 11101001 0010110100000110
Why is Barry Bonds not batting 4th? 11101101 0011110100000110
Building a UV flashlight 10110100 0010001000101011
How to drive an array of LEDs 10110101 0010001000101001
Electronics
2% silver solder 11010101 0010001000101011
Subliminal message flashing on TV 10110100 0010011000101001

Table 5: Examples of learned compact hashing codes on 20Newsgroups dataset.

0.85 Stochastic ilarity measure between latent variable z and the


0.80 Deterministic word embeddings Ek for every word, and the
0.75 probabilities for words that present in the docu-
Precison

0.70 ment is maximized to ensure that z is informative.


0.65 As a result, if we allow the decoder to be too ex-
0.60 pressive (e.g., a one-layer MLP), it is likely that
0.55 we will end up with a very flexible similarity mea-
Reuters 20Newsgroups TMC
sure but relatively less meaningful binary repre-
Dataset
sentations. This finding is consistent with several
Figure 3: The precisions of the top 100 retrieved image hashing methods, such as SGH (Dai et al.,
documents for NASH-DN with stochastic or de- 2017) or binary autoencoder (Carreira-Perpinán
terministic binary latent variables. and Raziperchikolaei, 2015), where a linear de-
coder is typically adopted to obtain promising re-
trieval results. However, our experiments may not
using an encoder or decoder with different non- speak for other choices of encoder-decoder archi-
linearity, ranging from a linear transformation to tectures, e.g., LSTM-based sequence-to-sequence
two-layer MLPs. We employ a base model with models (Sutskever et al., 2014) or DCNN-based
an encoder of two-layer MLPs and a linear de- autoencoder (Zhang et al., 2017).
coder (the setup described in Section 3), and the
ablation study results are shown in Table 6. 5.3 Qualitative Analysis
5.3.1 Analysis of Semantic Information
Network Encoder Decoder
linear 0.5844 0.6225
To understand what information has been learned
one-layer MLP 0.6187 0.3559 in our NASH model, we examine the matrix
two-layer MLP 0.6225 0.1047 E ∈ Rd×l in (6). Similar to (Miao et al., 2016;
Larochelle and Lauly, 2012), we select the 5 near-
Table 6: Ablation study with different en- est words according to the word vectors learned
coder/decoder networks. from NASH and compare with the corresponding
results from NVDM.
It is observed that for the encoder networks, in- As shown in Table 2, although our NASH model
creasing the non-linearity by stacking MLP layers contains a binary latent variable, rather than a con-
leads to better empirical results. In other words, tinuous one as in NVDM, it also effectively group
endowing the hash function with more modeling semantically-similar words together in the learned
capacity is advantageous to retrieval tasks. How- vector space. This further demonstrates that the
ever, when we employ a non-linear network for proposed generative framework manages to by-
the decoder, the retrieval precision drops dramat- pass the binary/discrete constraint and is able to
ically. It is worth noting that the only difference abstract useful semantic information from docu-
between linear transformation and one-layer MLP ments.
is whether a non-linear activation function is em-
ployed or not. 5.3.2 Case Study
This observation may be attributed the fact that In Table 5, we show some examples of the
the decoder networks can be considered as a sim- learned binary hashing codes on 20Newsgroups
dataset. We observe that for both 8-bit and 16- Ting Chen, Martin Renqiang Min, and Yizhou Sun.
bit cases, NASH typically compresses documents 2017. Learning k-way d-dimensional discrete code
for compact embedding representations. arXiv
with shared topics into very similar binary codes.
preprint arXiv:1711.03067 .
On the contrary, the hashing codes for documents
with different topics exhibit much larger Ham- Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, and
ming distance. As a result, relevant documents can Le Song. 2017. Stochastic generative hashing.
arXiv preprint arXiv:1701.02815 .
be efficiently retrieved by simply computing their
Hamming distances. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Va-
hab S Mirrokni. 2004. Locality-sensitive hashing
scheme based on p-stable distributions. In Proceed-
6 Conclusions ings of the twentieth annual symposium on Compu-
tational geometry. ACM, pages 253–262.
This paper presents a first step towards end-to-end
semantic hashing, where the binary/discrete con- Geoffrey Hinton. 2012. Neural networks for ma-
straints are carefully handled with an effective gra- chine learning, coursera. URL: http://coursera.
org/course/neuralnets .
dient estimator. A neural variational framework
is introduced to train our model. Motivated by Itay Hubara, Matthieu Courbariaux, Daniel Soudry,
the connections between the proposed method and Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized
neural networks. In Advances in neural information
rate-distortion theory, we inject data-dependent processing systems. pages 4107–4115.
noise into the Bernoulli latent variable at the train-
ing stage. The effectiveness of our framework is Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-
ical reparameterization with gumbel-softmax. arXiv
demonstrated with extensive experiments.
preprint arXiv:1611.01144 .
Acknowledgments We would like to thank the Diederik P Kingma and Jimmy Ba. 2014. Adam: A
ACL reviewers for their insightful suggestions. method for stochastic optimization. arXiv preprint
This research was supported in part by DARPA, arXiv:1412.6980 .
DOE, NIH, NSF and ONR. Diederik P Kingma and Max Welling. 2013. Auto-
encoding variational bayes. arXiv preprint
arXiv:1312.6114 .
References Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Richard Zemel, Raquel Urtasun, Antonio Torralba,
2016. End-to-end optimization of nonlinear trans- and Sanja Fidler. 2015. Skip-thought vectors. In
form codes for perceptual quality. In Picture Coding Advances in neural information processing systems.
Symposium (PCS), 2016. IEEE, pages 1–5. pages 3294–3302.
Yehuda Koren. 2008. Factorization meets the neigh-
Yoshua Bengio, Nicholas Léonard, and Aaron borhood: a multifaceted collaborative filtering
Courville. 2013. Estimating or propagating gradi- model. In Proceedings of the 14th ACM SIGKDD
ents through stochastic neurons for conditional com- international conference on Knowledge discovery
putation. arXiv preprint arXiv:1308.3432 . and data mining. ACM, pages 426–434.
Toby Berger. 1971. Rate-distortion theory. Encyclope- Hugo Larochelle and Stanislas Lauly. 2012. A neural
dia of Telecommunications . autoregressive topic model. In Advances in Neural
Information Processing Systems. pages 2708–2716.
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-
drew M Dai, Rafal Jozefowicz, and Samy Ben- Michael S Lew, Nicu Sebe, Chabane Djeraba, and
gio. 2015. Generating sentences from a continuous Ramesh Jain. 2006. Content-based multimedia in-
space. arXiv preprint arXiv:1511.06349 . formation retrieval: State of the art and challenges.
ACM Transactions on Multimedia Computing, Com-
Miguel A Carreira-Perpinán and Ramin Raziperchiko- munications, and Applications (TOMM) 2(1):1–19.
laei. 2015. Hashing with binary autoencoders. In Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. 2017.
Computer Vision and Pattern Recognition (CVPR), Deep supervised discrete hashing. arXiv preprint
2015 IEEE Conference on. IEEE, pages 557–566. arXiv:1705.10999 .
Suthee Chaidaroon and Yi Fang. 2017. Variational Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and
deep semantic hashing for text documents. In Pro- Shih-Fu Chang. 2012. Supervised hashing with ker-
ceedings of the 40th international ACM SIGIR con- nels. In Computer Vision and Pattern Recognition
ference on Research and development in information (CVPR), 2012 IEEE Conference on. IEEE, pages
retrieval. ACM. 2074–2081.
Chris J Maddison, Andriy Mnih, and Yee Whye Teh. Benno Stein, Sven Meyer zu Eissen, and Martin Pot-
2016. The concrete distribution: A continuous thast. 2007. Strategies for retrieving plagiarized
relaxation of discrete random variables. arXiv documents. In Proceedings of the 30th annual inter-
preprint arXiv:1611.00712 . national ACM SIGIR conference on Research and
development in information retrieval. ACM, pages
Christopher D Manning, Prabhakar Raghavan, Hinrich 825–826.
Schütze, et al. 2008. Introduction to information re-
trieval, volume 1. Cambridge university press Cam- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
bridge. Sequence to sequence learning with neural net-
works. In Advances in neural information process-
Yishu Miao, Edward Grefenstette, and Phil Blun- ing systems. pages 3104–3112.
som. 2017. Discovering discrete latent topics Lucas Theis, Wenzhe Shi, Andrew Cunningham, and
with neural variational inference. arXiv preprint Ferenc Huszár. 2017. Lossy image compression
arXiv:1706.00359 . with compressive autoencoders. ICLR .
Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neu- Aaron van den Oord, Oriol Vinyals, et al. 2017. Neu-
ral variational inference for text processing. In In- ral discrete representation learning. In Advances
ternational Conference on Machine Learning. pages in Neural Information Processing Systems. pages
1727–1736. 6309–6318.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Jingdong Wang, Heng Tao Shen, Jingkuan Song, and
rado, and Jeff Dean. 2013. Distributed representa- Jianqiu Ji. 2014. Hashing for similarity search: A
tions of words and phrases and their compositional- survey. arXiv preprint arXiv:1408.2927 .
ity. In Advances in neural information processing Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2012.
systems. pages 3111–3119. Semi-supervised hashing for large-scale search.
IEEE Transactions on Pattern Analysis and Machine
Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Intelligence 34(12):2393–2406.
Vanja Josifovski, Ravi Kumar, and Sergei Vassilvit-
skii. 2009. Nearest-neighbor caching for content- Qifan Wang, Dan Zhang, and Luo Si. 2013. Semantic
match applications. In Proceedings of the 18th in- hashing using tags and topic modeling. In Proceed-
ternational conference on World wide web. ACM, ings of the 36th international ACM SIGIR confer-
pages 441–450. ence on Research and development in information
retrieval. ACM, pages 213–222.
Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Se-
mantic hashing. International Journal of Approxi- Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,
mate Reasoning 50(7):969–978. Jiaji Huang, Wei Ping, Sanjeev Satheesh, and
Lawrence Carin. 2018. Topic compositional neural
Dinghan Shen, Martin Renqiang Min, Yitong Li, and language model. In AISTATS.
Lawrence Carin. 2017a. Adaptive convolutional fil- Yair Weiss, Antonio Torralba, and Rob Fergus. 2009.
ter generation for natural language understanding. Spectral hashing. In Advances in neural information
arXiv preprint arXiv:1709.08294 . processing systems. pages 1753–1760.
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun
Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Zhao, Fangyuan Wang, and Hongwei Hao. 2015.
yuan Li, Ricardo Henao, and Lawrence Carin. Convolutional neural networks for text hashing. In
2018. Baseline needs more love: On simple word- IJCAI. pages 1369–1375.
embedding-based models and associated pooling
mechanisms. In ACL. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and
Taylor Berg-Kirkpatrick. 2017. Improved varia-
Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang tional autoencoders for text modeling using dilated
Su, and Lawrence Carin. 2017b. Deconvolutional convolutions. arXiv preprint arXiv:1702.08139 .
latent-variable model for text sequence matching.
Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu.
AAAI .
2010. Self-taught hashing for fast similarity search.
In Proceedings of the 33rd international ACM SI-
Raphael Shu and Hideki Nakayama. 2017. Compress- GIR conference on Research and development in in-
ing word embeddings via deep compositional code formation retrieval. ACM, pages 18–25.
learning. arXiv preprint arXiv:1711.01068 .
Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan,
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ricardo Henao, and Lawrence Carin. 2017. Decon-
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. volutional paragraph representation learning. In Ad-
Dropout: A simple way to prevent neural networks vances in Neural Information Processing Systems.
from overfitting. The Journal of Machine Learning pages 4172–4182.
Research 15(1):1929–1958.

You might also like