Deep Learning of Semantic Word Representations To Implement A Content-Based Recommender For The Recsys Challenge'14
Deep Learning of Semantic Word Representations To Implement A Content-Based Recommender For The Recsys Challenge'14
1 Introduction
Latent representation of text is an important task in context-based recommender
systems. Especially cold-start problems impose a need to trust the content to
infer accurate recommendations when few (or non-existing) user ratings are
provided. Beyond are the n-grams and bag-of-words models to demonstrate
text, continuous representations techniques such as Latent Dirichlet Allocation
(LDA), Latent Semantic Analysis (LSA), and Principal Component Analysis
(PCA) which have been used to describe the content of a document as a proba-
bility distribution of latent variables known as topics.
The idea is that a sparse matrix M that characterizes the user preferences
(rows) in items (columns) can be factorized into two matrices U and V of joint
latent factor space of dimensionality K. This way the user preference u of item v
can be approximated by the dot product uT v. This method is known as matrix
factorization and has been proved to be effective in the Netflix Prize competition
combining better scalability and predictive accuracy than Collaborative Filtering
methods [1]. We have followed a similar approach, but have not assume a random
initialization for matrix V . Our hypothesis is that the features describing a
document can be learned in an unsupervised way considering how words form
a document in sentences. This provides a context for each word and thus not
every word is independent of each other as in the bag-of-words model.
2 Approach
Our method consists of two steps: a) feature learning, and b) user preference
learning. While the second step is the traditional matrix, factorization with
stochastic gradient descend, the feature learning step uses a neural network to
learn a continuous representation of words according to its context in a sentence.
Google has shown Word2Vec as a similar deep-learning approach to model se-
mantic word representations1 , but the task of recommendations on this new
representation is still unexplored.
1
https://code.google.com/p/word2vec/
The network is trained by stochastic gradient descend using back propagation
algorithm as in [5]. Hidden and output states are computed as follows:
X X
sj (t) = f ( wi (t)Uji + sl (t 1)Wjl )
i l
X
yk (t) = g( sj (t)Vkj )
j
where f (x) and g(x) are sigmoid and softmax activity functions:
1 exm
f (x) = g(xm ) = P x
1 + ex ke
k
When the RNN is trained, the output layer y(t) contain P (wt+1 |wt , s(t 1)),
the probability distribution of a word given a history of words (context) stored in
time t 1. To obtain a per document topic distribution, we match the empirical
distribution of words in a document d by using a continuous distribution over
these words indexed by a random variable .
Z Z N
Y
P (d) = P (d, ) = P ( ) P (wi |) d
i=1
ui = (v T v)1 v T yui
and then predict preference values for new documents within V .
4 Results
Given the 8,170 DBpedia URIs provided in the competition, we extract the
abstract of each book with SPARQL to obtain a vector of size (8, 170 V ),
where V is the number of words in the vocabulary. After stop-removal and
removing words with low frequency, we ended up with V 1, 500. We then
trained 99 RNNs with hidden layers ranging from 2 to 100 nodes to cover a
broad range in the number of latent features K to describe content in the DBbook
dataset. Similarly, we have set the same number of topics for LDA, LSA, and
PCA in the experiments (Figure 3 (a)). Initially we had considered gradient
descend to evaluate the regression weights during user preference learning, but
this approach (with 1, 2, 3 5, 10, 20, and 100 iterations) didnt provide lower
Root Mean Squared Error (RMSE) values than using the normal equation for
linear regression (Figure 3 (b)), so we used the normal equation as the best
(and fastest) approach. When gradient descend was considered, we measured
100 values for the regularization parameter in the range of 106 and 103 , so
in total 9, 900 deep learning-based recommenders were implemented.
(a) (b)
5 Learned Lessons
During the implementation the recommender system, we come up with the fol-
lowing findings:
Modeling word contexts provides a semantic relationship between words that
improves the latent representation of documents.
When the matrix V contains enough information to describe the structure in
the content of documents, updating U and V with coordinate ascend provides
minor improvements. Thus, we can train U with a very small number of
iterations if enough effort has been devoted to train V or both steps can be
performed independently without loss of RMSE.
The above approach requires a more aggressive regularization parameter to
control overfitting in matrix U . Empirically, large numbers for the regular-
ization parameters provide the smallest RMSE with this setting.
Surprisingly, projecting LDA topics to an orthogonal space with PCA no-
tably improved the prediction results. We believe this is because the linear
combination of features and weights provided by the regression algorithms
is better suit in a space with low correlation between latent variables.
6 Conclusions
We presented a recommender system that uses the semantic word properties of a
type of deep learning algorithm called Recurrent Neural Network. This method
provides a lower and less sparse representation of content of text documents.
Our results and submission to the RecSys Challenge shows that a RNN provides
a lower RMSE values than a latent representation with LDA, PCA, and LSA.
References
1. Robert M. Bell and Yehuda Koren. Lessons from the netflix prize challenge.
SIGKDD Explor. Newsl., 9(2):7579, December 2007.
2. Andrew L Maas and Andrew Y Ng. A probabilistic model for semantic word vectors.
In NIPS, volume 10, 2010.
3. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. Computing Research Repository, 2013.
4. Tomas Mikolov and et al. Distributed representations of words and phrases and
their compositionality. In NIPS, 2013.
5. Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in con-
tinuous space word representations. In Conference of the North American Chapter
of the Association for Computational Linguistics, pages 746751, 2013.