Nash
Nash
Hashing
in many information retrieval systems. ated by simply calculating pairwise Hamming dis-
While fairly successful, previous tech- tances between hashing codes, i.e., the number of
niques generally require two-stage train- bits that are different between two codes. Given
ing, and the binary constraints are han- that today, an ordinary PC is able to execute mil-
dled ad-hoc. In this paper, we present lions of Hamming distance computations in just a
an end-to-end Neural Architecture for Se- few milliseconds (Zhang et al., 2010), this seman-
mantic Hashing (NASH), where the binary tic hashing strategy is very computationally attrac-
hashing codes are treated as Bernoulli la- tive.
tent variables. A neural variational in-
ference framework is proposed for train- While considerable research has been devoted
ing, where gradients are directly back- to text (semantic) hashing, existing approaches
propagated through the discrete latent typically require two-stage training procedures.
variable to optimize the hash function. These methods can be generally divided into two
We also draw connections between pro- categories: (i) binary codes for documents are first
posed method and rate-distortion the- learned in an unsupervised manner, then l binary
ory, which provides a theoretical foun- classifiers are trained via supervised learning to
dation for the effectiveness of the pro- predict the l-bit hashing code (Zhang et al., 2010;
posed framework. Experimental results on Xu et al., 2015); (ii) continuous text representa-
three public datasets demonstrate that our tions are first inferred, which are binarized as a
method significantly outperforms several second (separate) step during testing (Wang et al.,
state-of-the-art models on both unsuper- 2013; Chaidaroon and Fang, 2017). Because the
vised and supervised scenarios. model parameters are not learned in an end-to-end
manner, these two-stage training strategies may re-
1 Introduction sult in suboptimal local optima. This happens be-
cause different modules within the model are opti-
The problem of similarity search, also called mized separately, preventing the sharing of infor-
nearest-neighbor search, consists of finding doc- mation between them. Further, in existing meth-
uments from a large collection of documents, or ods, binary constraints are typically handled ad-
corpus, which are most similar to a query doc- hoc by truncation, i.e., the hashing codes are ob-
ument of interest. Fast and accurate similarity tained via direct binarization from continuous rep-
search is at the core of many information retrieval resentations after training. As a result, the in-
applications, such as plagiarism analysis (Stein formation contained in the continuous representa-
et al., 2007), collaborative filtering (Koren, 2008), tions is lost during the (separate) binarization pro-
content-based multimedia retrieval (Lew et al., cess. Moreover, training different modules (map-
2006) and caching (Pandey et al., 2009). Semantic ping and classifier/binarization) separately often
hashing is an effective approach for fast similarity requires additional hyperparameter tuning for each
search (Salakhutdinov and Hinton, 2009; Zhang training stage, which can be laborious and time-
∗
Equal contribution. consuming.
In this paper, we propose a simple and generic <latexit sha1_base64="4gsoFBpBBAbmyfn2ZeNA3fTqK6U=">AAAB73icbVBNTwIxEJ3FL8Qv1KOXRmKCF7JrSNQb0YtHTFzBwIZ0Sxca2u6m7RrJhl/hxYMar/4db/4bC+xBwZdM8vLeTGbmhQln2rjut1NYWV1b3yhulra2d3b3yvsH9zpOFaE+iXms2iHWlDNJfcMMp+1EUSxCTlvh6Hrqtx6p0iyWd2ac0EDggWQRI9hY6WHQ6yZDVn067ZUrbs2dAS0TLycVyNHslb+6/ZikgkpDONa647mJCTKsDCOcTkrdVNMEkxEe0I6lEguqg2x28ASdWKWPoljZkgbN1N8TGRZaj0VoOwU2Q73oTcX/vE5qoosgYzJJDZVkvihKOTIxmn6P+kxRYvjYEkwUs7ciMsQKE2MzKtkQvMWXl4l/Vrusubf1SuMqT6MIR3AMVfDgHBpwA03wgYCAZ3iFN0c5L8678zFvLTj5zCH8gfP5A5/Qj9M=</latexit>
g (x) z
<latexit sha1_base64="WIlbTbBFLLcqOvt81zBc03GagJU=">AAAB53icbVBNS8NAEJ3Ur1q/qh69LBbBU0lFUG9FLx5bMLbQhrLZTtq1m03Y3Qg19Bd48aDi1b/kzX/jts1BWx8MPN6bYWZekAiujet+O4WV1bX1jeJmaWt7Z3evvH9wr+NUMfRYLGLVDqhGwSV6hhuB7UQhjQKBrWB0M/Vbj6g0j+WdGSfoR3QgecgZNVZqPvXKFbfqzkCWSS0nFcjR6JW/uv2YpRFKwwTVulNzE+NnVBnOBE5K3VRjQtmIDrBjqaQRaj+bHTohJ1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1ISXfsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2ZRsCLXFl5eJd1a9qrrN80r9Ok+jCEdwDKdQgwuowy00wAMGCM/wCm/Og/PivDsf89aCk88cwh84nz9XU4zR</latexit>
<latexit
0.1 0
neural architecture for text hashing that learns bi- z 0 ⇠ N (z, I)
nary latent codes for documents in an end-to- 1
<latexit sha1_base64="oL/kS60CrA7r8ceuyOQmklSUP/Y=">AAACD3icbVBNS8NAEN3Ur1q/oh69LBaxgpREBPVW9KIXqWBsoYlls922S3eTsLsR2pCf4MW/4sWDilev3vw3btoctPXBwOO9GWbm+RGjUlnWt1GYm19YXCoul1ZW19Y3zM2tOxnGAhMHhywUTR9JwmhAHEUVI81IEMR9Rhr+4CLzGw9ESBoGt2oYEY+jXkC7FCOlpba5P7pP3EhQTlJXUg5djlQfI5Zcp5XRIdRajyN4ddA2y1bVGgPOEjsnZZCj3ja/3E6IY04ChRmSsmVbkfISJBTFjKQlN5YkQniAeqSlaYA4kV4yfiiFe1rpwG4odAUKjtXfEwniUg65rzuze+W0l4n/ea1YdU+9hAZRrEiAJ4u6MYMqhFk6sEMFwYoNNUFYUH0rxH0kEFY6w5IOwZ5+eZY4R9WzqnVzXK6d52kUwQ7YBRVggxNQA5egDhyAwSN4Bq/gzXgyXox342PSWjDymW3wB8bnD54+nHs=</latexit>
0.9
MLP
log 2 x̂
<latexit sha1_base64="9fy0Mz7X/Akugg6I9ARal+HBkn4=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEUG9FLx4rGFtoQ9lsN+3SzSbsTsQS+iO8eFDx6v/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmlldW19o7xZ2dre2d2r7h88mCTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXJtRKLucZzyIKYDJSLBKFqp1R1SzJ8mvWrNrbszkGXiFaQGBZq96le3n7As5gqZpMZ0PDfFIKcaBZN8UulmhqeUjeiAdyxVNOYmyGfnTsiJVfokSrQthWSm/p7IaWzMOA5tZ0xxaBa9qfif18kwugxyodIMuWLzRVEmCSZk+jvpC80ZyrEllGlhbyVsSDVlaBOq2BC8xZeXiX9Wv6q7d+e1xnWRRhmO4BhOwYMLaMAtNMEHBiN4hld4c1LnxXl3PuatJaeYOYQ/cD5/ABu5j5w=</latexit>
ing) codes are represented as either deterministic Figure 1: NASH for end-to-end semantic hashing.
or stochastic Bernoulli latent variables. The infer- The inference network maps x → z using an MLP
ence (encoder) and generative (decoder) networks and the generative network recovers x as z → x̂.
are optimized jointly by maximizing a variational
lower bound to the marginal distribution of input
documents (corpus). By leveraging a simple and van den Oord et al. (2017) combined VAEs with
effective method to estimate the gradients with re- vector quantization to learn discrete latent repre-
spect to discrete (binary) variables, the loss term sentation, and demonstrated the utility of these
from the generative (decoder) network can be di- learned representations on images, videos, and
rectly backpropagated into the inference (encoder) speech data. Li et al. (2017) leveraged both pair-
network to optimize the hash function. wise label and classification information to learn
discrete hash codes, which exhibit state-of-the-art
Motivated by the rate-distortion theory (Berger,
performance on image retrieval tasks.
1971; Theis et al., 2017), we propose to inject
data-dependent noise into the latent codes during For natural language processing (NLP), al-
the decoding stage, which adaptively accounts for though significant research has been made to learn
the tradeoff between minimizing rate (number of continuous deep representations for words or doc-
bits used, or effective code length) and distortion uments (Mikolov et al., 2013; Kiros et al., 2015;
(reconstruction error) during training. The con- Shen et al., 2018), discrete neural representations
nection between the proposed method and rate- have been mainly explored in learning word em-
distortion theory is further elucidated, providing a beddings (Shu and Nakayama, 2017; Chen et al.,
theoretical foundation for the effectiveness of our 2017). In these recent works, words are repre-
framework. sented as a vector of discrete numbers, which are
very efficient storage-wise, while showing compa-
Summarizing, the contributions of this paper
rable performance on several NLP tasks, relative
are: (i) to the best of our knowledge, we present
to continuous word embeddings. However, dis-
the first semantic hashing architecture that can
crete representations that are learned in an end-
be trained in an end-to-end manner; (ii) we pro-
to-end manner at the sentence or document level
pose a neural variational inference framework to
have been rarely explored. Also there is a lack of
learn compact (regularized) binary codes for doc-
strict evaluation regarding their effectiveness. Our
uments, achieving promising results on both unsu-
work focuses on learning discrete (binary) repre-
pervised and supervised text hashing; (iii) the con-
sentations for text documents. Further, we em-
nection between our method and rate-distortion
ploy semantic hashing (fast similarity search) as
theory is established, from which we demonstrate
a mechanism to evaluate the quality of learned bi-
the advantage of injecting data-dependent noise
nary latent codes.
into the latent variable during training.
et al., 2013; Hubara et al., 2016; Theis et al., where rate and distortion denote the effective code
2017). With the ST gradient estimator, the first length, i.e., the number of bits used, and the dis-
loss term in (3) can be backpropagated into the tortion introduced by the encoding/decoding se-
encoder network to fine-tune the hash function quence, respectively. Further, x̂ is the recon-
gφ (x). structed input and β is a hyperparameter that con-
For the approximate generator qθ (x|z) in (3), let trols the tradeoff between the two terms.
xi denote the one-hot representationPof ith word Considering the case where we have a Bernoulli
within a document. Note that x = i xi is thus prior on z as p(z) ∼ Bernoulli(γ), and x
the bag-of-words representation for document x. conditionally drawn from a Gaussian distribution
To reconstruct the input x from z, we utilize a soft- p(x|z) ∼ N (Ez, σ 2 I). Here, E = {ei }i=1 ,
|V |
max decoding function written as: where ei ∈ Rd can be interpreted as a codebook
exp(z T Exw + bw ) with |V | codewords. In our case, E corresponds
q(xi = w|z) = P|V | , (6) to the word embedding matrix as in (6).
exp(z T Ex + b )
j=1 j j
For the case of stochastic latent variable z, the
where q(xi = w|z) is the Q probability that xi is objective function in (3) can be written in a form
word w ∈ V , qθ (x|z) = i q(xi = w|z) and similar to the rate-distortion tradeoff:
θ = {E, b1 , . . . , b|V | }. Note that E ∈ Rd×|V | can
be interpreted as a word embedding matrix to be 1
2
|V |
learned, and {bi }i=1 denote bias terms. Intuitively, − log qφ (z|x) + 2 ||x − Ez||2 +C ,
min Eqφ (z|x)
| {z } 2σ |
|{z} {z }
the objective in (6) encourages the discrete vector Rate
β Distortion
z to be close to the embeddings for every word (8)
where C is a constant that encapsulates the prior (i) Reuters21578, containing 10,788 news docu-
distribution p(z) and the Gaussian distribution ments, which have been classified into 90 differ-
normalization term. Notably, the trade-off hyper- ent categories. (ii) 20Newsgroups, a collection of
parameter β = σ −2 /2 is closely related to the 18,828 newsgroup documents, which are catego-
variance of the distribution p(x|z). In other words, rized into 20 different topics. (iii) TMC (stands
by controlling the variance σ, the model can adap- for SIAM text mining competition), containing air
tively explore different trade-offs between the rate traffic reports provided by NASA. TMC consists
and distortion objectives. However, the optimal 21,519 training documents divided into 22 differ-
trade-offs for distinct samples may be different. ent categories. To make direct comparison with
Inspired by the observations above, we propose prior works, we employed the TFIDF features on
to inject data-dependent noise into latent variable these datasets supplied by (Chaidaroon and Fang,
z, rather than to setting the variance term σ 2 to a 2017), where the vocabulary sizes for the three
fixed value (Dai et al., 2017; Theis et al., 2017). datasets are set to 10,000, 7,164 and 20,000, re-
Specifically, log σ 2 is obtained via a one-layer spectively.
MLP transformation from gφ (x). Afterwards, we
sample z 0 from N (z, σ 2 I), which then replace z in 4.2 Training Details
(6) to infer the probability of generating individual For the inference networks, we employ a feed-
words (as shown in Figure 1). As a result, the vari- forward neural network with 2 hidden layers (both
ances are different for every input document x, and with 500 units) using the ReLU non-linearity ac-
thus the model is provided with additional flexibil- tivation function, which transform the input doc-
ity to explore various trade-offs between rate and uments, i.e., TFIDF features in our experiments,
distortion for different training observations. Al- into a continuous representation. Empirically, we
though our decoder is not a strictly Gaussian dis- found that stochastic binarization as in (2) shows
tribution, as in (6), we found empirically that in- stronger performance than deterministic binariza-
jecting data-dependent noise into z yields strong tion, and thus use the former in our experiments.
retrieval results, see Section 5.1. However, we further conduct a systematic ablation
study in Section 5.2 to compare the two binariza-
3.4 Supervised Hashing
tion strategies.
The proposed Neural Architecture for Semantic Our model is trained using Adam (Kingma and
Hashing (NASH) can be extended to supervised Ba, 2014), with a learning rate of 1 × 10−3 for all
hashing, where a mapping from latent variable z parameters. We decay the learning rate by a fac-
to labels y is learned, here parametrized by a two- tor of 0.96 for every 10,000 iterations. Dropout
layer MLP followed by a fully-connected softmax (Srivastava et al., 2014) is employed on the output
layer. To allow the model to explore and balance of encoder networks, with the rate selected from
between maximizing the variational lower bound {0.7, 0.8, 0.9} on the validation set. To facilitate
in (3) and minimizing the discriminative loss, the comparisons with previous methods, we set the di-
following joint training objective is employed: mension of z, i.e., the number of bits within the
hashing code) as 8, 16, 32, 64, or 128.
L = −Lvae (θ, φ; x) + αLdis (η; z, y). (9)
4.3 Baselines
where η refers to parameters of the MLP classi-
fier and α controls the relative weight between We evaluate the effectiveness of our framework on
the variational lower bound (Lvae ) and discrimina- both unsupervised and supervised semantic hash-
tive loss (Ldis ), defined as the cross-entropy loss. ing tasks. We consider the following unsuper-
The parameters {θ, φ, η} are learned end-to-end vised baselines for comparisons: Locality Sensi-
via Monte Carlo estimation. tive Hashing (LSH) (Datar et al., 2004), Stack Re-
stricted Boltzmann Machines (S-RBM) (Salakhut-
4 Experimental Setup dinov and Hinton, 2009), Spectral Hashing (SpH)
(Weiss et al., 2009), Self-taught Hashing (STH)
4.1 Datasets (Zhang et al., 2010) and Variational Deep Se-
We use the following three standard publicly mantic Hashing (VDSH) (Chaidaroon and Fang,
available datasets for training and evaluation: 2017).
Method 8 bits 16 bits 32 bits 64 bits 128 bits 1.0
LSH 0.2802 0.3215 0.3862 0.4667 0.5194
S-RBM 0.5113 0.5740 0.6154 0.6177 0.6452
Precison (%)
SpH
STH
0.6080
0.6616
0.6340
0.7351
0.6513
0.7554
0.6290
0.7350
0.6045
0.6986 0.9 KSH
VDSH 0.6859 0.7165 0.7753 0.7456 0.7318 SHTTM
NASH 0.7113 0.7624 0.7993 0.7812 0.7559 VDSH-S
VDSH-SP
NASH-N
NASH-DN
0.7352
0.7470
0.7904
0.8013
0.8297
0.8418
0.8086
0.8297
0.7867
0.7924
0.8 NASH-DN-S
816 32 64 128
Table 1: Precision of the top 100 retrieved docu- Number of Bits
ments on Reuters dataset (Unsupervised hashing).
Figure 2: Precision of the top 100 retrieved doc-
For supervised semantic hashing, we also com- uments on Reuters dataset (Supervised hashing),
pare NASH against a number of baselines: Su- compared with other supervised baselines.
pervised Hashing with Kernels (KSH) (Liu et al.,
2012), Semantic Hashing using Tags and Topic 5.1 Semantic Hashing Evaluation
Modeling (SHTTM) (Wang et al., 2013) and Su-
Table 1 presents the results of all models on
pervised VDSH (Chaidaroon and Fang, 2017). It
Reuters dataset. Regarding unsupervised seman-
is worth noting that unlike all these baselines, our
tic hashing, all the NASH variants consistently
NASH model is trained end-to-end in one-step.
outperform the baseline methods by a substan-
4.4 Evaluation Metrics tial margin, indicating that our model makes the
most effective use of unlabeled data and manage
To evaluate the hashing codes for similarity to assign similar hashing codes, i.e., with small
search, we consider each document in the testing Hamming distance to each other, to documents
set as a query document. Similar documents to that belong to the same label. It can be also
the query in the corresponding training set need observed that the injection of noise into the de-
to be retrieved based on the Hamming distance of coder networks has improved the robustness of
their hashing codes, i.e. number of different bits. learned binary representations, resulting in better
To facilitate comparison with prior work (Wang retrieval performance. More importantly, by mak-
et al., 2013; Chaidaroon and Fang, 2017), the per- ing the variances of noise adaptive to the specific
formance is measured with precision. Specifically, input, our NASH-DN achieves even better results,
during testing, for a query document, we first re- compared with NASH-N, highlighting the impor-
trieve the 100 nearest/closest documents accord- tance of exploring/learning the trade-off between
ing to the Hamming distances of the correspond- rate and distortion objectives by the data itself.
ing hash codes (i.e., the number of different bits). We observe the same trend and superiority of our
We then examine the percentage of documents NASH-DN models on the other two benchmarks,
among these 100 retrieved ones that belong to the as shown in Tables 3 and 4.
same label (topic) with the query document (we
Another observation is that the retrieval results
consider documents having the same label as rel-
tend to drop a bit when we set the length of hash-
evant pairs). The ratio of the number of relevant
ing codes to be 64 or larger, which also happens
documents to the number of retrieved documents
for some baseline models. This phenomenon has
(fixed value of 100) is calculated as the precision
been reported previously in Wang et al. (2012);
score. The precision scores are further averaged
Liu et al. (2012); Wang et al. (2013); Chaida-
over all test (query) documents.
roon and Fang (2017), and the reasons could be
twofold: (i) for longer codes, the number of data
5 Experimental Results
points that are assigned to a certain binary code
We experimented with four variants for our NASH decreases exponentially. As a result, many queries
model: (i) NASH: with deterministic decoder; (ii) may fail to return any neighbor documents (Wang
NASH-N: with fixed random noise injected to de- et al., 2012); (ii) considering the size of train-
coder; (iii) NASH-DN: with data-dependent noise ing data, it is likely that the model may over-
injected to decoder; (iv) NASH-DN-S: NASH-DN fit with long hash codes (Chaidaroon and Fang,
with supervised information during training. 2017). However, even with longer hashing codes,
Word weapons medical companies define israel book
gun treatment company definition israeli books
guns disease market defined arabs english
NASH weapon drugs afford explained arab references
armed health products discussion jewish learning
assault medicine money knowledge jews reference
guns medicine expensive defined israeli books
weapon health industry definition arab reference
NVDM gun treatment company printf arabs guide
militia disease market int lebanon writing
armed patients buy sufficient lebanese pages
Table 2: The five nearest words in the semantic space learned by NASH, compared with the results from
NVDM (Miao et al., 2016).
Method 8 bits 16 bits 32 bits 64 bits 128 bits where the label or topic information is utilized dur-
Unsupervised Hashing
ing training. As shown in Figure 2, our NASH-
LSH 0.0578 0.0597 0.0666 0.0770 0.0949
S-RBM 0.0594 0.0604 0.0533 0.0623 0.0642 DN-S model consistently outperforms several su-
SpH 0.2545 0.3200 0.3709 0.3196 0.2716 pervised semantic hashing baselines, with vari-
STH 0.3664 0.5237 0.5860 0.5806 0.5443
ous choices of hashing bits. Notably, our model
VDSH 0.3643 0.3904 0.4327 0.1731 0.0522
NASH 0.3786 0.5108 0.5671 0.5071 0.4664 exhibits higher Top-100 retrieval precision than
NASH-N 0.3903 0.5213 0.5987 0.5143 0.4776 VDSH-S and VDSH-SP, proposed by Chaidaroon
NASH-DN 0.4040 0.5310 0.6225 0.5377 0.4945
and Fang (2017). This may be attributed to the fact
Supervised Hashing
KSH 0.4257 0.5559 0.6103 0.6488 0.6638 that in VDSH models, the continuous embeddings
SHTTM 0.2690 0.3235 0.2357 0.1411 0.1299 are not optimized with their future binarization in
VDSH-S 0.6586 0.6791 0.7564 0.6850 0.6916
mind, and thus could hurt the relevance of learned
VDSH-SP 0.6609 0.6551 0.7125 0.7045 0.7117
NASH-DN-S 0.6247 0.6973 0.8069 0.8213 0.7840 binary codes. On the contrary, our model is opti-
mized in an end-to-end manner, where the gradi-
Table 3: Precision of the top 100 retrieved docu- ents are directly backpropagated to the inference
ments on 20Newsgroups dataset. network (through the binary/discrete latent vari-
able), and thus gives rise to a more robust hash
Method 8 bits 16 bits 32 bits 64 bits 128 bits function.
Unsupervised Hashing
LSH 0.4388 0.4393 0.4514 0.4553 0.4773 5.2 Ablation study
S-RBM 0.4846 0.5108 0.5166 0.5190 0.5137
SpH 0.5807 0.6055 0.6281 0.6143 0.5891 5.2.1 The effect of stochastic sampling
STH 0.3723 0.3947 0.4105 0.4181 0.4123
VDSH 0.4330 0.6853 0.7108 0.4410 0.5847 As described in Section 3, the binary latent vari-
NASH 0.5849 0.6573 0.6921 0.6548 0.5998 ables z in NASH can be either deterministically
NASH-N 0.6233 0.6759 0.7201 0.6877 0.6314
NASH-DN 0.6358 0.6956 0.7327 0.7010 0.6325
(via (1)) or stochastically (via (2)) sampled. We
Supervised Hashing compare these two types of binarization functions
KSH 0.6608 0.6842 0.7047 0.7175 0.7243 in the case of unsupervised hashing. As illustrated
SHTTM 0.6299 0.6571 0.6485 0.6893 0.6474
VDSH-S 0.7387 0.7887 0.7883 0.7967 0.8018
in Figure 3, stochastic sampling shows stronger re-
VDSH-SP 0.7498 0.7798 0.7891 0.7888 0.7970 trieval results on all three datasets, indicating that
NASH-DN-S 0.7438 0.7946 0.7987 0.8014 0.8139 endowing the sampling process of latent variables
with more stochasticity improves the learned rep-
Table 4: Precision of the top 100 retrieved docu- resentations.
ments on TMC dataset.
5.2.2 The effect of encoder/decoder networks
our NASH models perform stronger than the base- Under the variational framework introduced here,
lines in most cases (except for the 20Newsgroups the encoder network, i.e., hash function, and de-
dataset), suggesting that NASH can effectively al- coder network are jointly optimized to abstract se-
locate documents to informative/meaningful hash- mantic features from documents. An interesting
ing codes even with limited training data. question concerns what types of network should
We also evaluate the effectiveness of NASH be leveraged for each part of our NASH model.
in a supervised scenario on the Reuters dataset, In this regard, we further investigate the effect of
Category Title/Subject 8-bit code 16-bit code
Dave Kingman for the hall of fame 11101001 0010110100000110
Time of game 11111001 0010100100000111
Baseball
Game score report 11101001 0010110100000110
Why is Barry Bonds not batting 4th? 11101101 0011110100000110
Building a UV flashlight 10110100 0010001000101011
How to drive an array of LEDs 10110101 0010001000101001
Electronics
2% silver solder 11010101 0010001000101011
Subliminal message flashing on TV 10110100 0010011000101001
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Jingdong Wang, Heng Tao Shen, Jingkuan Song, and
rado, and Jeff Dean. 2013. Distributed representa- Jianqiu Ji. 2014. Hashing for similarity search: A
tions of words and phrases and their compositional- survey. arXiv preprint arXiv:1408.2927 .
ity. In Advances in neural information processing Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2012.
systems. pages 3111–3119. Semi-supervised hashing for large-scale search.
IEEE Transactions on Pattern Analysis and Machine
Sandeep Pandey, Andrei Broder, Flavio Chierichetti, Intelligence 34(12):2393–2406.
Vanja Josifovski, Ravi Kumar, and Sergei Vassilvit-
skii. 2009. Nearest-neighbor caching for content- Qifan Wang, Dan Zhang, and Luo Si. 2013. Semantic
match applications. In Proceedings of the 18th in- hashing using tags and topic modeling. In Proceed-
ternational conference on World wide web. ACM, ings of the 36th international ACM SIGIR confer-
pages 441–450. ence on Research and development in information
retrieval. ACM, pages 213–222.
Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Se-
mantic hashing. International Journal of Approxi- Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,
mate Reasoning 50(7):969–978. Jiaji Huang, Wei Ping, Sanjeev Satheesh, and
Lawrence Carin. 2018. Topic compositional neural
Dinghan Shen, Martin Renqiang Min, Yitong Li, and language model. In AISTATS.
Lawrence Carin. 2017a. Adaptive convolutional fil- Yair Weiss, Antonio Torralba, and Rob Fergus. 2009.
ter generation for natural language understanding. Spectral hashing. In Advances in neural information
arXiv preprint arXiv:1709.08294 . processing systems. pages 1753–1760.
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun
Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Zhao, Fangyuan Wang, and Hongwei Hao. 2015.
yuan Li, Ricardo Henao, and Lawrence Carin. Convolutional neural networks for text hashing. In
2018. Baseline needs more love: On simple word- IJCAI. pages 1369–1375.
embedding-based models and associated pooling
mechanisms. In ACL. Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and
Taylor Berg-Kirkpatrick. 2017. Improved varia-
Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang tional autoencoders for text modeling using dilated
Su, and Lawrence Carin. 2017b. Deconvolutional convolutions. arXiv preprint arXiv:1702.08139 .
latent-variable model for text sequence matching.
Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu.
AAAI .
2010. Self-taught hashing for fast similarity search.
In Proceedings of the 33rd international ACM SI-
Raphael Shu and Hideki Nakayama. 2017. Compress- GIR conference on Research and development in in-
ing word embeddings via deep compositional code formation retrieval. ACM, pages 18–25.
learning. arXiv preprint arXiv:1711.01068 .
Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan,
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ricardo Henao, and Lawrence Carin. 2017. Decon-
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. volutional paragraph representation learning. In Ad-
Dropout: A simple way to prevent neural networks vances in Neural Information Processing Systems.
from overfitting. The Journal of Machine Learning pages 4172–4182.
Research 15(1):1929–1958.