0% found this document useful (0 votes)
32 views

Application of Dimensionality Reduction in Recommender System - A Case Study

Uploaded by

Thiago Salles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Application of Dimensionality Reduction in Recommender System - A Case Study

Uploaded by

Thiago Salles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Application of Dimensionality Reduction in Recommender System -- A Case Study

Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl


GroupLens Research Group / Army HPC Research Center
Department of Computer Science and Engineering
University of Minnesota
Minneapolis, MN 55455
+1 612 625-4002
{sarwar, karypis, konstan, riedl}@cs.umn.edu

experience suggests that SVD has the potential to


Abstract meet many of the challenges of recommender
systems, under certain conditions.
We investigate the use of dimensionality reduction to
improve performance for a new class of data analysis
software called “recommender systems”.
1 Introduction
Recommender systems apply knowledge discovery
Recommender systems have evolved in the extremely
techniques to the problem of making product
interactive environment of the Web. They apply data
recommendations during a live customer interaction.
analysis techniques to the problem of helping
These systems are achieving widespread success in
customers find which products they would like to
E-commerce nowadays, especially with the advent of
purchase at E-Commerce sites. For instance, a
the Internet. The tremendous growth of customers
recommender system on Amazon.com
and products poses three key challenges for
(www.amazon.com) suggests books to customers
recommender systems in the E-commerce domain.
based on other books the customers have told
These are: producing high quality recommendations,
Amazon they like. Another recommender system on
performing many recommendations per second for
CDnow (www.cdnow.com) helps customers choose
millions of customers and products, and achieving
CDs to purchase as gifts, based on other CDs the
high coverage in the face of data sparsity. One
recipient has liked in the past. In a sense,
successful recommender system technology is
recommender systems are an application of a
collaborative filtering, which works by matching
particular type of Knowledge Discovery in Databases
customer preferences to other customers in making
(KDD) (Fayyad et al. 1996) technique. KDD
recommendations. Collaborative filtering has been
systems use many subtle data analysis techniques to
shown to produce high quality recommendations, but
achieve two unsubtle goals. They are: i) to save
the performance degrades with the number of
money by discovering the potential for efficiencies,
customers and products. New recommender system
or ii) to make more money by discovering ways to
technologies are needed that can quickly produce
sell more products to customers. For instance,
high quality recommendations, even for very large-
companies are using KDD to discover which
scale problems.
products sell well at which times of year, so they can
This paper presents two different experiments where manage their retail store inventory more efficiently,
we have explored one technology called Singular potentially saving millions of dollars a year
Value Decomposition (SVD) to reduce the (Brachman et al. 1996). Other companies are using
dimensionality of recommender system databases. KDD to discover which customers will be most
Each experiment compares the quality of a interested in a special offer, reducing the costs of
recommender system using SVD with the quality of a direct mail or outbound telephone campaigns by
recommender system using collaborative filtering. hundreds of thousands of dollars a year
The first experiment compares the effectiveness of (Bhattacharyya 1998, Ling et al. 1998). These
the two recommender systems at predicting consumer applications typically involve using KDD to discover
preferences based on a database of explicit ratings of a new model, and having an analyst apply the model
products. The second experiment compares the to the application. However, the most direct benefit
effectiveness of the two recommender systems at of KDD to businesses is increasing sales of existing
producing Top-N lists based on a real-life customer products by matching customers to the products they
purchase database from an E-Commerce site. Our will be most likely to purchase. The Web presents
new opportunities for KDD, but challenges KDD correlation between the opinions of the users. These
systems to perform interactively. While a customer are called nearest-neighbor techniques. Figure 1
is at the E-Commerce site, the recommender system depicts the neighborhood formation using a nearest-
must learn from the customer’s behavior, develop a neighbor technique in a very simple two dimensional
model of that behavior, and apply that model to space. Notice that each user’s neighborhood is those
recommend products to the customer. Recommender other users who are most similar to him, as identified
systems directly realize this benefit of KDD systems by the proximity measure. Neighborhoods need not
in E-Commerce. They help consumers find the be symmetric. Each user has the best neighborhood
products they wish to buy at the E-Commerce site. for him. Once a neighborhood of users is found,
Collaborative filtering is the most successful particular products can be evaluated by forming a
recommender system technology to date, and is used weighted composite of the neighbors’ opinions of
in many of the most successful recommender systems that document.
on the Web, including those at Amazon.com and
These statistical approaches, known as automated
CDnow.com.
collaborative filtering, typically rely upon ratings as
The earliest implementations of collaborative numerical expressions of user preference. Several
filtering, in systems such as Tapestry (Goldberg et ratings-based automated collaborative filtering
al., 1992), relied on the opinions of people from a systems have been developed. The GroupLens
close-knit community, such as an office workgroup. Research system (Resnick et al. 1994) provides a
However, collaborative filtering for large pseudonymous collaborative filtering solution for

1
2
3
5
4

Figure 1: Illustration of the neighborhood formation process. The distance between the
target user and every other user is computed and the closest-k users are chosen as the
neighbors (for this diagram k = 5).

communities cannot depend on each person knowing Usenet news and movies. Ringo (Shardanand et al.
the others. Several systems use statistical techniques 1995) and Video Recommender (Hill et al. 1995) are
to provide personal recommendations of documents email and web systems that generate
by finding a group of other users, known as recommendations on music and movies respectively.
neighbors that have a history of agreeing with the Here we present the schematic diagram of the
target user. Usually, neighborhoods are formed by architecture of the GroupLens Research collaborative

Recommender
System
Request Ratings Engine
Ratings
Dynamic
WWW
HTML
Response Server generator
Recomm-
Recomm- endations
Customer endations
Correlation Ratings
Figure 2. Recommender System Architecture Database Database

applying proximity measures such as the Pearson filtering engine in figure 2. The user interacts with a
Web interface. The Web server software concludes the paper and provides directions for future
communicates with the recommender system to research.
choose products to suggest to the user. The
recommender system, in this case a collaborative
filtering system, uses its database of ratings of 2 Existing Recommender Systems
products to form neighborhoods and make Approaches and their Limitations
recommendations. The Web server software displays
the recommended products to the user. Most collaborative filtering based recommender
systems build a neighborhood of likeminded
The largest Web sites operate at a scale that stresses customers. The Neighborhood formation scheme
the direct implementation of collaborative filtering. usually uses Pearson correlation or cosine similarity
Model-based techniques (Fayyad et al., 1996) have as a measure of proximity (Shardanand et al. 1995,
the potential to contribute to recommender systems Resnick et al. 1994). Once these systems determine
that can operate at the scale of these sites. However, the proximity neighborhood they produce two types
these techniques must be adapted to the real-time of recommendations.
needs of the Web, and they must be tested in realistic
problems derived from Web access patterns. The 1. Prediction of how much a customer C will like a
present paper describes our experimental results in product P. In case of correlation based
applying a model-based technique, Latent Semantic algorithm, prediction on product ‘P’ for
Indexing (LSI), that uses a dimensionality reduction customer ‘C’ is computed by computing a
technique, Singular Value Decomposition (SVD), to weighted sum of co-rated items between C and
our recommender system. We use two data sets in all his neighbors and then by adding C's average
our experiments to test the performance of the model- rating to that. This can be expressed by the
based technique: a movie dataset and an e-commerce following formula (Resnick et al., 1994):


dataset.
( J P − J )rCJ
The contributions of this paper are: C P pred = C + J ∈ rates

1. The details of how one model-based ∑ J


rCJ
technology, LSI/SVD, was applied to Here, rCJ denotes the correlation between user C
reduce dimensionality in recommender and neighbor J. JP is J's ratings on product P.
systems for generating predictions.
J and C are J and C's average ratings. The
2. Using low dimensional representation prediction is personalized for the customer C.
to compute neighborhood for generating There are, however, some naive non-
recommendations. personalized prediction schemes where
prediction, for example, is computed simply by
3. The results of our experiments with taking the average ratings of items being
LSI/SVD on two test data sets— our predicted over all users (Herlocker et al., 1999).
MovieLens test-bed and customer- 2. Recommendation of a list of products for a
product purchase data from a large E- customer C. This is commonly known as top-N
commerce company, which has asked to recommendation. Once a neighborhood is
remain anonymous. formed, the recommender system algorithm
focuses on the products rated by the neighbors
The rest of the paper is organized as follows. The and selects a list of N products that will be liked
next section describes some potential problems by the customer.
associated with correlation-based collaborative
filtering models. Section 3 explores the possibilities These systems have been successful in several
of leveraging the latent semantic relationship in domains, but the algorithm is reported to have shown
customer-product matrix as a basis for prediction some limitations, such as:
generation. At the same time it explains how we can
• Sparsity: Nearest neighbor algorithms rely upon
take the advantage of reduced dimensionality to form
exact matches that cause the algorithms to
better neighborhood of customers. The section
sacrifice recommender system coverage and
following that delineates our experimental test-bed,
accuracy (Konstan et al., 1997. Sarwar et al.,
experimental design, results and discussion about the
1998). In particular, since the correlation
improvement in quality and performance. Section 5
coefficient is only defined between customers
who have rated at least two products in common,
many pairs of customers have no correlation at evaluated and rated each product, using syntactic
all (Billsus et al., 1998). In practice, many features. By providing a dense ratings set, they
commercial recommender systems are used to helped alleviate coverage and improved quality. The
evaluate large product sets (e.g., Amazon.com filtering agent solution, however, did not address the
recommends books and CDnow recommends fundamental problem of poor relationships among
music albums). In these systems, even active like-minded but sparse-rating customers. We
customers may have rated well under 1% of the recognized that the KDD research community had
products (1% of 2 million books is 20,000 extensive experience learning from sparse databases.
books--a large set on which to have an opinion). After reviewing several KDD techniques, we decided
Accordingly, Pearson nearest neighbor to try applying Latent Semantic Indexing (LSI) to
algorithms may be unable to make many product reduce the dimensionality of our customer-product
recommendations for a particular user. This ratings matrix.
problem is known as reduced coverage, and is
LSI is a dimensionality reduction technique that has
due to sparse ratings of neighbors. Furthermore,
been widely used in information retrieval (IR) to
the accuracy of recommendations may be poor
solve the problems of synonymy and polysemy
because fairly little ratings data can be included.
(Deerwester et al. 1990). Given a term-document-
An example of a missed opportunity for quality
frequency matrix, LSI is used to construct two
is the loss of neighbor transitivity. If customers
matrices of reduced dimensionality. In essence, these
Paul and Sue correlate highly, and Sue also
matrices represent latent attributes of terms, as
correlates highly with Mike, it is not necessarily
reflected by their occurrence in documents, and of
true that Paul and Mike will correlate. They may
documents, as reflected by the terms that occur
have too few ratings in common or may even
within them. We are trying to capture the
show a negative correlation due to a small
relationships among pairs of customers based on
number of unusual ratings in common.
ratings of products. By reducing the dimensionality
• Scalability: Nearest neighbor algorithms require of the product space, we can increase density and
computation that grows with both the number of thereby find more ratings. Discovery of latent
customers and the number of products. With relationship from the database may potentially solve
millions of customers and products, a typical the synonymy problem in recommender systems.
web-based recommender system running LSI, which uses singular value decomposition as its
existing algorithms will suffer serious scalability underlying matrix factorization algorithm, maps
problems. nicely into the collaborative filtering recommender
algorithm challenge. Berry et al. (1995) point out that
• Synonymy: In real life scenario, different product the reduced orthogonal dimensions resulting from
names can refer to the similar objects. SVD are less noisy than the original data and capture
Correlation based recommender systems can't the latent associations between the terms and
find this latent association and treat these documents. Earlier work (Billsus et al. 1998) took
products differently. For example, let us consider advantage of this semantic property to reduce the
two customers one of them rates 10 different dimensionality of feature space. The reduced feature
recycled letter pad products as "high" and space was used to train a neural network to generate
another customer rates 10 different recycled predictions. The rest of this section presents the
memo pad products "high". Correlation based construction of SVD-based recommender algorithm
recommender systems would see no match for the purpose of generating predictions and top-N
between product sets to compute correlation and recommendations; the following section describes
would be unable to discover the latent our experimental setup, evaluation metrics, and
association that both of them like recycled office results.
products.
3.1 Singular Value Decomposition (SVD)
3 Applying SVD for Collaborative Filtering SVD is a well-known matrix factorization technique
that factors an m × n matrix R into three matrices as
The weakness of Pearson nearest neighbor for large,
the following:
sparse databases led us to explore alternative
recommender system algorithms. Our first approach R = U ⋅S ⋅V ′
attempted to bridge the sparsity by incorporating
Where, U and V are two orthogonal matrices of size
semi-intelligent filtering agents into the system
(Sarwar et al., 1998, Good et al., 1999). These agents m × r and n × r respectively; r is the rank of the
matrix R. S is a diagonal matrix of size r × r having These resultant matrices can now be used to compute
all singular values of matrix R as its diagonal entries. the recommendation score for any customer c and
All the entries of matrix S are positive and stored in product p. We observe that the dimension of UkSk1/2
decreasing order of their magnitude. The matrices is m × k and the dimension of Sk1/2Vk′is k × n. To
obtained by performing SVD are particularly useful compute the prediction we simply calculate the dot
for our application because of the property that SVD product of the cth row of UkSk1/2 and the pth column of
provides the best lower rank approximations of the Sk1/2Vk′and add the customer average back using the
original matrix R, in terms of Frobenius norm. It is following:
possible to reduce the r × r matrix S to have only k ′ ′
largest diagonal values to obtain a matrix Sk, k < r. If C Ppred = C + U K . S k (c) ⋅ S k .V k ( P ) .
the matrices U and V are reduced accordingly, then
the reconstructed matrix Rk = Uk.Sk.Vk′is the closest Note that even though the Rnorm matrix is dense, the
rank-k matrix to R. In other words, Rk minimizes the special structure of the matrix NPR allows us to use
Frobenius norm ||R- Rk|| over all rank-k matrices. sparse SVD algorithms (e.g., Lanczos) whose
complexity is almost linear to the number of non-
We use SVD in recommender systems to perform zeros in the original matrix R.
two different tasks: First, we use SVD to capture
latent relationships between customers and products
that allow us to compute the predicted likeliness of a
3.1.2 Recommendation generation
certain product by a customer. Second, we use SVD In our second experiment, we look into the prospects
to produce a low-dimensional representation of the of using low-dimensional space as a basis for
original customer-product space and then compute neighborhood formation and using the neighbors’
neighborhood in the reduced space. We then used opinions on products they purchased we recommend
that to generate a list of top-N product a list of N products for a given customer. For this
recommendations for customers. The following is a purpose we consider customer preference data as
description of our experiments. binary by treating each non-zero entry of the
customer-product matrix as "1". This means that we
3.1.1 Prediction Generation are only interested in whether a customer consumed a
particular product but not how much he/she liked that
We start with a customer-product ratings matrix that product.
is very sparse, we call this matrix R. To capture
meaningful latent relationship we first removed
Neighborhood formation in the reduced space:
sparsity by filling our customer-product ratings
matrix. We tried two different approaches: using the The fact that the reduced dimensional representation
average ratings for a customer and using the average of the original space is less sparse than its high-
ratings for a product. We found the product average dimensional counterpart led us to form the
produce a better result. We also considered two neighborhood in that space. As before, we started
normalization techniques: conversion of ratings to z- with the original customer-product matrix A, and then
scores and subtraction of customer average from each used SVD to produce three decomposed matrices U,
rating. We found the latter approach to provide S, and V. We then reduced S by retaining only k
better results. After normalization we obtain a filled, eigenvalues and obtained Sk. Accordingly, we
normalized matrix Rnorm. Essentially, Rnorm = R+NPR, performed dimensionality reduction to obtain Uk and
where NPR is the fill-in matrix that provides naive Vk. Like the previous method, we finally computed
non-personalized recommendation. We factor the the matrix product UkSk1/2. This m × k matrix is the k
matrix Rnorm and obtain a low-rank approximation dimensional representation of m customers. We then
after applying the following steps described in performed vector similarity (cosine similarity) to
(Deerwester et al. 1990): form the neighborhood in that reduced space.
• factor Rnorm using SVD to obtain U, S and V.
Top-N Recommendation generation:
• reduce the matrix S to dimension k
Once the neighborhood is formed we concentrate on
• compute the square-root of the reduced the neighbors of a given customer and analyze the
matrix Sk, to obtain Sk1/2 products they purchased to recommend N products
the target customer is most likely to purchase. After
• compute two resultant matrices: UkSk1/2 and computing the neighborhood for a given customer C,
Sk1/2Vk′ we scan through the purchase record of each of the k
neighbors and perform a frequency count on the
products they purchased. The product list is then better than correlation based algorithms. For the same
sorted and most frequently purchased N items are reason, neighborhood formation is also much faster
returned as recommendations for the target customer. when done in low dimensional space.
We call this scheme most frequent item
recommendation.
4 Experiments
3.1.3 Sensitivity of Number of Dimensions k
The optimal choice of the value k is critical to high- 4.1 Experimental Platform
quality prediction generation. We are interested in a
value of k that is large enough to capture all the 4.1.1 Data sets
important structures in the matrix yet small enough to As mentioned before we report two different
avoid overfitting errors. We experimentally find a experiments. In the first experiment we used data
good value of k by trying several different values. from our MovieLens recommender system to
evaluate the effectiveness of our SVD-based
3.1.4 Performance Implications prediction generation algorithm. MovieLens
In practice, e-commerce sites like amazon.com (www.movielens.umn.edu) is a web-based research
experiences tremendous amount of customer visits recommender system that debuted in Fall 1997. Each
per day. Recommending products to these large week hundreds of users visit MovieLens to rate and
number of customers in real-time requires the receive recommendations for movies. The site now
underlying recommendation engine to be highly has over 35000 users who have expressed opinions
scalable. Recommendation algorithms usually divide on 3000+ different movies.1 We randomly selected
the prediction generation algorithm into two parts: enough users to obtain 100,000 rating-records from
offline component and online component. Offline the database (we only considered users that had rated
component is the portion of the algorithm that twenty or more movies). Rating-record in this context
requires an enormous amount of computation e.g., is defined to be a triplet <customer, product, rating>.
the computation of customer-customer correlation in We divided the rating-records into training set and a
case of correlation-based algorithm. Online test set according to different ratios. We call this
component is the portion of the algorithm that is training ratio and denote it by x. A value of x=0.3
dynamically computed to provide predictions to indicates that we divide the 100,000 ratings data set
customers using data from stored offline component. into 30,000 train cases and 70,000 test cases. The
In case of SVD-based recommendation generation, training data was converted into a user-movie matrix
the decomposition of the customer-product matrix R that had 943 rows (i.e., 943 users) and 1682
and computing the reduced user and item matrices columns (i.e., 1682 movies that were rated by at least
one of the users). Each entry ri,j represented the
i.e., UkSk1/2 and Sk1/2Vk′can be done offline.
rating (from 1 to 5) of the ith user on the jth movie.
Offline computation is not very critical to the
performance of the recommender system. But there The second experiment is designed to test the
are some issues with the memory and secondary effectiveness of “neighborhood formed in low
storage requirement that need to be addressed. In case dimensional space”. In addition to the above movie
of SVD, the offline component requires more time data, we used historical catalog purchase data from a
compared to the correlation-based algorithm. For an large e-commerce company. This data set contains
purchase information of 6,502 users on 23,554
m × n matrix the SVD decomposition requires a time
catalog items. In total, this data set contains 97,045
in the order of O((m+n)3) (Deerwester et. al., 1990).
purchase records. In case of the commerce data set,
Computation of correlation takes O(m2.n). In terms of
each record is a triplet <customer, product, purchase
storage, however, SVD is more efficient, we need to
amount>. Since, purchase amount can’t be
store just two reduced customer and product matrices
meaningfully converted to user rating, we didn’t use
of size m × k and k × n respectively, a total of the second data set for prediction experiment. We
O(m+n), since k is constant. But in case of the
correlation-based CF algorithm, an m × m all-to-all
correlation table must be stored requiring O(m2) 1
storage, which can be substantially large with In addition to MovieLens' users, the system includes over
millions of customers and products. two million ratings from more than 45,000 EachMovie
users. The EachMovie data is based on a static collection
So, we observe that as a result of dimensionality made available for research by Digital Equipment
reduction SVD based online performance is much Corporation's Systems Research Center.
converted all purchase amounts to “1” to make the computed as the percentage of customer-product
data set binary and then used it for recommendation pairs for which a recommendation can be made.
experiment. As before, we divided the data set into a
§ Statistical accuracy metrics evaluate the accuracy
train set and a test set by using similar notion of
of a system by comparing the numerical
training ratio, x.
recommendation scores against the actual
customer ratings for the customer-product pairs
4.1.2 Benchmark recommender systems in the test dataset. Mean Absolute Error (MAE),
To compare the performance of SVD-based Root Mean Squared Error (RMSE) and
prediction we also entered the training ratings set into Correlation between ratings and predictions are
a collaborative filtering recommendation engine that widely used metrics. Our experience has shown
employs the Pearson nearest neighbor algorithm. For that these metrics typically track each other
this purpose we implemented CF-Predict, a flexible closely.
recommendation engine that implements § Decision support accuracy metrics evaluate how
collaborative filtering algorithms using C. We tuned effective a prediction engine is at helping a user
CF-Predict to use the best published Pearson nearest select high-quality products from the set of all
neighbor algorithm and configured it to deliver the products. These metrics assume the prediction
highest quality prediction without concern for process as a binary operation— either products
performance (i.e., it considered every possible are predicted (good) or not (bad). With this
neighbor to form optimal neighborhoods). To observation, whether a product has a prediction
compare the quality of SVD neighborhood-based score of 1.5 or 2.5 on a five-point scale is
recommendations, we implemented another irrelevant if the customer only chooses to
recommender system that uses cosine-similarity in consider predictions of 4 or higher. The most
high dimensional space to form neighborhood and commonly used decision support accuracy
returns top-N recommendations, we call it CF- metrics are reversal rate, weighted errors and
Recommend. We used cosine measure for building ROC sensitivity (Le et al., 1995)
neighborhood in both cases because in the low
dimensional space proximity is measured only by We used MAE as our choice of evaluation metric to
computing the cosine. report prediction experiments because it is most
commonly used and easiest to interpret directly. In
For each of the ratings in the test data set, we our previous experiments (Sarwar et al., 1999) we
requested a prediction from CF-Predict and also have seen that MAE and ROC provide the same
computed the same prediction from the matrices ordering of different experimental schemes in terms
UkSk1/2 and Sk1/2Vk′and compared them. Similarly, we of prediction quality.
compared two top-N recommendation algorithms.
4.2.2 Recommendation evaluation metrics
4.2 Evaluation Metrics
To evaluate top-N recommendation we use two
Recommender systems research has used several metrics widely used in the information retrieval (IR)
types of measures for evaluating the success of a community namely recall and precision. However,
recommender system. We only consider the quality we slightly modify the definition of recall and
of prediction or recommendation, as we're only precision as our experiment is different from standard
interested in the output of a recommender system for IR. We divide the products into two sets: the test set
the evaluation purpose. It is, however, possible to and top-N set. Products that appear in both sets are
evaluate intermediate steps (e.g., the quality of members of the hit set. We now define recall and
neighborhood formation). Here we discuss two types precision as the following:
of metrics for evaluating predictions and top-N
recommendations respectively. § Recall in the context of the recommender
system is defined as:
4.2.1 Prediction evaluation metrics testI top N
size of hit set
To evaluate an individual item prediction researchers Recall = =
size of test set test
use the following metrics:
§ Precision is defined as:
§ Coverage metrics evaluate the number of
products for which the system could provide size of hit set testI topN
recommendations. Overall coverage is Precision = =
size of topN set N
These two measures are, however, often conflicting entry pij of this resultant matrix P holds the prediction
in nature. For instance, increasing the number N score for each user-movie pair i,j. We then de-
tends to increase recall but decreases precision. The normalized the matrix entries by adding the user
fact that both are critical for the quality judgement average back into each prediction scores and loaded
leads us to use a combination of the two. In the training set ratings into CF-Predict and request
particular, we use the standard F1 metric (Yang et. prediction scores on each of the test set ratings.
al., 1999) that gives equal weight to them both and is Computed MAE of the SVD and the CF-Predict
computed as follows: prediction scores and compare the two sets of results.
We repeated the entire process for k = 2, 5-21, 25, 50
2 ∗ Recall ∗ Precsion
F1 = and 100, and found 14 to be the most optimum value
(Recall + Precision ) (Figure 3(a)). We then fixed k at 14 and varied the
We compute F1 for each individual customer and train/test ratio x from 0.2 to 0.95 with an increment of
calculate the average value to use as our metric. 0.05 and for each point we run the experiment 10
times each time choosing different training/test sets
4.3 Experimental Steps and take the average to generate the plots. Note that
the overall performance of the SVD-based prediction
algorithm does significantly change for a wide range
4.3.1 Prediction Experiment. of values of k.

Each entry in our data matrix R represents a rating on 4.3.2 Top-N recommendation experiment:
a 1-5 scale, except that in cases where the user i
didn’t rate movie j the entry ri,j is null. We then We started with a matrix as the previous experiment
performed the following experimental steps. but converted the rating entries (i.e., non-zero entries)
to "1". Then we produced top-10 product
We computed the average ratings for each user and recommendations for each customer based on the
for each movie and filled the null entries in the following two schemes:
matrix by replacing each null entry with the column
average for the corresponding column. Then we § High dimensional neighborhood: In this scheme
normalized all entries in the matrix by replacing each we built the customer neighborhood in the
entry ri,j with (ri,j - ri ), where, ri is the row average of original customer-product space and used most
the ith row. Then MATLAB was used to compute the frequent item recommendation to produce top-10
SVD of the filled and normalized matrix R, product list. We then used our F1 metric to
producing the three SVD component matrices U, S evaluate the quality.
and V'. S is the matrix that contains the singular § Low dimensional neighborhood: We first reduce
values of matrix R sorted in decreasing order. Sk was the dimensionality of the original space by
computed from S by retaining only k largest singular applying SVD and then used UkSk1/2 (i.e.,
values and replacing the rest of the singular with 0. representation of customers in k dimensional
We computed the square root of the reduced matrix space) matrix to build the neighborhood. As
and computed the matrix products UkSk1/2 and Sk1/2V'k before we used most frequent item
as mentioned above. We then multiplied the matrices recommendation to produce top-10 list and
UkSk1/2 and Sk1/2V'k producing a 943 x 1682 matrix, P. evaluated by using F1 metric.
Since the inner product of a row from UkSk1/2 and a
column from Sk1/2Vk gives us a prediction score, each In this experiment our main focus was on the E-

SVD prediction quality variation with number of dimension x=0.2 SVD as Prediction Generator Pure-CF
x=0.5 (k is fixed at 14 for SVD) SVD
x=0.8
0.8
0.86
0.79
0.84
Mean absolute error

0.78
0.82
0.77
0.8
MAE

0.76
0.78
0.75
0.76
0.74
0.74
0.73
0.72
2

9
25

35

45

55

65

75

85

95

0.72
0.

0.

0.

0.

0.

0.

0.

0.
0.

0.

0.

0.

0.

0.

0.

0.

2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 25 50 100

number of dimension, k x (train/test ratio)

Figure 3. (a) Determination of optimum value of k. (b) SVD vs. CF-Predict prediction quality
commerce data. We also report our findings when we determined the optimum x ratio for both of our data
apply this technique on our movie preference data. sets in high dimensional and low dimensional cases.
At first we run the high dimensional experiment for
4.4 Results different x ratio and then we perform low
dimensional experiments for different x values for a
fixed dimension (k) and compute the F1 metric.
4.4.1Prediction experiment results
Figure 4 shows our results, we observe that optimum
Figure 3(b) charts our results for the prediction x values are 0.8 and 0.6 for the movie data and the E-
experiment. The data sets were obtained from the commerce data respectively.
same sample of 100,000 ratings, by varying the sizes
Once we obtain the best x value, we run high
of the training and test data sets, (recall that x is the
dimensional experiment for that x and compute F1

Determination of the optimum x value ML High-dim Determination of the optimum x value EC High-dim
(Movie data set) ML Low-dim (Commerce data set) EC Low-dim

0.24 0.16
0.22 0.14
0.2 0.12
F1 Metric

F1 Metric
0.18 0.1
0.08
0.16
0.06
0.14 0.04
0.12 0.02
0.1 0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

train ratio, x train ratio, x

Figure 4. Determination of the optimum value of x. a) for the Movie data b) for the Commerce data

ratio between the size of the training set and the size metric. Then we run our low-dimensional
of the entire data set). Note that the different values experiments for that x ratio, but vary the number of
of x were used to determine the sensitivity of the dimension, k. Our results are presented in figures 5
different schemes on the sparsity of the training set. and 6. We represent the corresponding high
dimensional results (i.e., results from CF-
4.4.2 Top-N recommendation experiment recommend) in the chart by drawing vertical lines at
results their corresponding values.

For the recommendation experiment, we first

Top-10 recommendation ML Low-dim


(Movie data set) ML High-dim
0.232
0.23
0.228
F1 Metric

0.226
0.224
High dimensional value at x = 0.8
0.222
0.22
10 20 30 40 50 60 70 80 90 100
Dimension, k

Figure 5. Top-10 recommendation results for the MovieLens data set.


4.5 Discussion approximation of the original space. Also another
factor to consider is the amount of sparsity in the data
In case of the prediction experiment, we observe that sets, the movie data is 95.4% sparse (100,000
in Figure 3(b) for x<0.5 SVD-based prediction is nonzero entries in 943x1,682 matrix), while the e-
better than the CF-Predict predictions. For x>0.5, commerce data is 99.996% sparse (97,045 nonzero
however, the CF-Predict predictions are slightly entries in 6,502x23,554 matrix). To test this
better. This suggests that nearest-neighbor based hypothesis we deliberately increased sparsity of our
collaborative filtering algorithms are susceptible to movie data (i.e., remove nonzero entries) and
data sparsity as the neighborhood formation process repeated the experiment and observed dramatic
is hindered by the lack of enough training data. On reduction in F1 values!
the other hand, SVD based prediction algorithms can
overcome the sparsity problem by utilizing the latent Overall, the results are encouraging for the use of
relationships. However, as the training data is SVD in collaborative filtering recommender systems.
increased both SVD and CF-Predict prediction The SVD algorithms fit well with the collaborative
quality improve but the improvement in case of CF- filtering data, and they result in good quality
Predict surpasses the SVD improvement. predictions. And SVD has potential to provide better
online performance than correlation-based systems.
From the plots of the recommender results (Figures 5 In case of the top-10 recommendation experiment we
and 6), we observe that for the movie data the best have seen even with a small fraction of dimension,
result happens in the vicinity of k=20 and in case of i.e., 20 out of 1682 in movie data, SVD-based

Top-10 recommendation EC Low-dim


(Commerce data set) EC High-dim

0.17
0.16
0.15 High dimensional value at x = 0.6
F1 Metric

0.14
0.13
0.12
0.11
0.1
0.09
50 100 150 200 250 300 350 400 450 500 600 700

Dimension, k

Figure 6. Top-10 recommendation results for the E-Commerce data set.

the e-commerce data the recommendation quality recommendation quality was better than
keeps on growing with increasing dimensions. The corresponding high dimensional scheme. It indicates
movie experiment reveals that the low dimensional that neighborhoods formed in the reduced
results are better than the high dimensional dimensional space are better than their high
counterpart at all values of k. In case of the e- dimensional counterparts.2
commerce experiment the high dimensional result is
always better, but as more and more dimensions are
added low dimensional values improve. However, we
increased the dimension values up to 700, but the low
dimensional values were still lower than the high
dimensional value. Beyond 700 the entire process
becomes computationally very expensive. Since the
2
commerce data is very high dimensional We’re also working with experiments to use the reduced
(6502x23554), probably such a small k value (up to dimensional neighborhood for prediction generation using
700) is not sufficient to provide a useful classical CF algorithm. So far, the results are encouraging.
5 Conclusions was also supported by NSF CCR-9972519, by Army
Research Office contract DA/DAAG55-98-1-0441,
Recommender systems are a powerful new by the DOE ASCI program and by Army High
technology for extracting additional value for a Performance Computing Research Center contract
business from its customer databases. These systems number DAAH04-95-C-0008. We thank anonymous
help customers find products they want to buy from a reviewers for their valuable comments.
business. Recommender systems benefit customers
by enabling them to find products they like. References
Conversely, they help the business by generating
more sales. Recommender systems are rapidly
1. Berry, M. W., Dumais, S. T., and O’Brian, G. W.
becoming a crucial tool in E-commerce on the Web.
1995. “Using Linear Algebra for Intelligent
Recommender systems are being stressed by the huge Information Retrieval”. SIAM Review, 37(4),
volume of customer data in existing corporate pp. 573-595.
databases, and will be stressed even more by the
2. Billsus, D., and Pazzani, M. J. 1998. “Learning
increasing volume of customer data available on the
Collaborative Information Filters”. In
Web. New technologies are needed that can
Proceedings of Recommender Systems
dramatically improve the scalability of recommender
Workshop. Tech. Report WS-98-08, AAAI
systems.
Press.
Our study shows that Singular Value Decomposition
3. Bhattacharyya, S. 1998. “Direct Marketing
(SVD) may be such a technology in some cases. We
Response Models using Genetic Algorithms.” In
tried several different approaches to using SVD for
Proceedings of the Fourth International
generating recommendations and predictions, and
Conference on Knowledge Discovery and Data
discovered one that can dramatically reduce the
Mining, pp. 144-148.
dimension of the ratings matrix from a collaborative
filtering system. The SVD-based approach was 4. Brachman, R., J., Khabaza, T., Kloesgen, W.,
consistently worse than traditional collaborative Piatetsky-Shapiro, G., and Simoudis, E. 1996.
filtering in se of an extremely sparse e-commerce “Mining Business Databases.” Communications
dataset. However, the SVD-based approach of the ACM, 39(11), pp. 42-48, November.
produced results that were better than a traditional
5. Deerwester, S., Dumais, S. T., Furnas, G. W.,
collaborative filtering algorithm some of the time in
Landauer, T. K., and Harshman, R. 1990.
the denser MovieLens data set. This technique leads
“Indexing by Latent Semantic Analysis”.
to very fast online performance, requiring just a few
Journal of the American Society for Information
simple arithmetic operations for each
Science, 41(6), pp. 391-407.
recommendation. Computing the SVD is expensive,
but can be done offline. Further research is needed to 6. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P.,
understand how often a new SVD must be computed, and Uthurusamy, R., Eds. 1996. “Advances in
or whether the same quality can be achieved with Knowledge Discovery and Data Mining”. AAAI
incremental SVD algorithms (Berry et. al., 1995). press/MIT press.
Future work is required to understand exactly why 7. Goldberg, D., Nichols, D., Oki, B. M., and
SVD works well for some recommender applications, Terry, D. 1992. “Using Collaborative Filtering to
and less well for others. Also, there are many other Weave an Information Tapestry”.
ways in which SVD could be applied to Communications of the ACM. December.
recommender systems problems, including using
SVD for neighborhood selection, or using SVD to 8. Good, N., Schafer, B., Konstan, J., Borchers, A.,
create low-dimensional visualizations of the ratings Sarwar, B., Herlocker, J., and Riedl, J. 1999.
"Combining Collaborative Filtering With
space.
Personal Agents for Better Recommendations."
In Proceedings of the AAAI-'99 conference, pp
6 Acknowledgements 439-446.
9. Heckerman, D. 1996. “Bayesian Networks for
Funding for this research was provided in part by the Knowledge Discovery.” In Advances in
National Science Foundation under grants IIS Knowledge Discovery and Data Mining. Fayyad,
9613960, IIS 9734442, and IIS 9978717 with U. M., Piatetsky-Shapiro, G., Smyth, P., and
additional funding by Net Perceptions Inc. This work Uthurusamy, R., Eds. AAAI press/MIT press.
10. Herlocker, J., Konstan, J., Borchers, A., and
Riedl, J. 1999. "An Algorithmic Framework for
Performing Collaborative Filtering." In
Proceedings of ACM SIGIR'99. ACM press.
11. Hill, W., Stead, L., Rosenstein, M., and Furnas,
G. 1995. “Recommending and Evaluating
Choices in a Virtual Community of Use”. In
Proceedings of CHI ’95.
12. Le, C. T., and Lindgren, B. R. 1995.
“Construction and Comparison of Two Receiver
Operating Characteristics Curves Derived from
the Same Samples”. Biom. J. 37(7), pp. 869-877.
13. Ling, C. X., and Li C. 1998. “Data Mining for
Direct Marketing: Problems and Solutions.” In
Proceedings of the 4th International Conference
on Knowledge Discovery and Data Mining, pp.
73-79.
14. Resnick, P., Iacovou, N., Suchak, M., Bergstrom,
P., and Riedl, J. 1994. “GroupLens: An Open
Architecture for Collaborative Filtering of
Netnews. In Proceedings of CSCW ’94, Chapel
Hill, NC.
15. Sarwar, B., M., Konstan, J. A., Borchers, A.,
Herlocker, J., Miller, B., and Riedl, J. 1998.
“Using Filtering Agents to Improve Prediction
Quality in the GroupLens Research
Collaborative Filtering System.” In Proceedings
of CSCW ’98, Seattle, WA.
16. Sarwar, B.M., Konstan, J.A., Borchers, A., and
Riedl, J. 1999. "Applying Knowledge from KDD
to Recommender Systems." Technical Report TR
99-013, Dept. of Computer Science, University
of Minnesota.
17. Schafer, J. B., Konstan, J., and Riedl, J. 1999.
“Recommender Systems in E-Commerce.” In
Proceedings of ACM E-Commerce 1999
conference.
18. Shardanand, U., and Maes, P. 1995. “Social
Information Filtering: Algorithms for
Automating ‘Word of Mouth’.” In Proceedings
of CHI ’95. Denver, CO.
19. Yang, Y., and Liu, X. 1999. "A Re-examination
of Text Categorization Methods." In Proceedings
of ACM SIGIR'99 conferenc, pp 42-49.
20. Zytkow, J. M. 1997. “Knowledge = Concepts: A
Harmful Equation.” In Proceedings of the Third
International Conference on Knowledge
Discovery and Data Mining.

You might also like