0% found this document useful (0 votes)
5 views

NIPS 2016 - blind regression - nonparametric regression for latent variable models via collaborative filtering Paper

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

NIPS 2016 - blind regression - nonparametric regression for latent variable models via collaborative filtering Paper

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Blind Regression: Nonparametric Regression for

Latent Variable Models via Collaborative Filtering

Christina E. Lee Yihua Li Devavrat Shah Dogyoon Song


Laboratory for Information and Decision Systems
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
{celee, liyihua, devavrat, dgsong}@mit.edu

Abstract
We introduce the framework of blind regression motivated by matrix completion
for recommendation systems: given m users, n movies, and a subset of user-movie
ratings, the goal is to predict the unobserved user-movie ratings given the data,
i.e., to complete the partially observed matrix. Following the framework of non-
parametric statistics, we posit that user u and movie i have features x1 (u) and
x2 (i) respectively, and their corresponding rating y(u, i) is a noisy measurement of
f (x1 (u), x2 (i)) for some unknown function f . In contrast with classical regression,
the features x = (x1 (u), x2 (i)) are not observed, making it challenging to apply
standard regression methods to predict the unobserved ratings.
Inspired by the classical Taylor’s expansion for differentiable functions, we pro-
vide a prediction algorithm that is consistent for all Lipschitz functions. In fact,
the analysis through our framework naturally leads to a variant of collaborative
filtering, shedding insight into the widespread success of collaborative filtering in
practice. Assuming each entry is sampled independently with probability at least
max(m−1+δ , n−1/2+δ ) with δ > 0, we prove that the expected fraction of our
estimates with error greater than  is less than γ 2 /2 plus a polynomially decaying
term, where γ 2 is the variance of the additive entry-wise noise term.
Experiments with the MovieLens and Netflix datasets suggest that our algorithm
provides principled improvements over basic collaborative filtering and is competi-
tive with matrix factorization methods.

1 Introduction
In this paper, we provide a statistical framework for performing nonparametric regression over latent
variable models. We are initially motivated by the problem of matrix completion arising in the
context of designing recommendation systems. In the popularized setting of Netflix, there are m
users, indexed by u ∈ [m], and n movies, indexed by i ∈ [n]. Each user u has a rating for each
movie i, denoted as y(u, i). The system observes ratings for only a small fraction of user-movie
pairs. The goal is to predict ratings for the rest of the unknown user-movie pairs, i.e., to complete
the partially observed m × n rating matrix. To be able to obtain meaningful predictions from the
partially observed matrix, it is essential to impose a structure on the data.
We assume each user u and movie i is associated to features x1 (u) ∈ X1 and x2 (i) ∈ X2 for some
compact metric spaces X1 , X2 equipped with Borel probability measures. Following the philosophy
of non-parametric statistics, we assume that there exists some function f : X1 × X2 → R such that
the rating of user u for movie i is given by
y(u, i) = f (x1 (u), x2 (i)) + ηui , (1)

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
where ηui is some independent bounded noise. We observe ratings for a subset of the user-movie
pairs, and the goal is to use the given data to predict f (x1 (u), x2 (i)) for all (u, i) ∈ [m] × [n] whose
rating is unknown. In classical nonparametric regression, we observe input features x1 (u), x2 (i)
along with the rating y(u, i) for each datapoint, and thus we can approximate the function f well
using local approximation techniques as long as f satisfies mild regularity conditions. However, in
our setting, we do not observe the latent features x1 (u), x2 (i), but instead we only observe the indices
(u, i). Therefore, we use blind regression to refer to the challenge of performing regression with
unobserved latent input variables. This paper addresses the question, does there exist a meaningful
prediction algorithm for general nonparametric regression when the input features are unobserved?

Related Literature. Matrix completion has received enormous attention in the past decade. Matrix
factorization based approaches, such as low-rank approximation, and neighborhood based approaches,
such as collaborative filtering, have been the primary ways to address the problem. In the recent
years, there has been exciting intellectual development in the context of matrix factorization based
approaches. Since any matrix can be factorized, its entries can be described by a function f in (1) with
the form f (x1 , x2 ) = xT1 x2 , and the goal of factorization is to recover the latent features for each row
and column. [25] was one of the earlier works to suggest the use of low-rank matrix approximation,
observing that a low-rank matrix has a comparatively small number of free parameters. Subsequently,
statistically efficient approaches were suggested using optimization based estimators, proving that
matrix factorization can fill in the missing entries with sample complexity as low as rn log n, where
r is the rank of the matrix [5, 23, 11, 21, 10]. There has been an exciting line of ongoing work to
make the resulting algorithms faster and scalable [7, 17, 4, 15, 24, 20].
Many of these approaches are based on the structural assumption that the underlying matrix is
low-rank and the matrix entries are reasonably “incoherent”. Unfortunately, the low-rank assumption
may not hold in practice. The recent work [8] makes precisely this observation, showing that a simple
non-linear, monotonic transformation of a low-rank matrix could easily produce an effectively high-
rank matrix, despite few free model parameters. They provide an algorithm and analysis specific to
the form of their model, which achieves sample complexity of O((mn)2/3 ). However, their algorithm
only applies to functions f which are a nonlinear monotonic transformation of the inner product of
the latent features. [6] proposes the universal singular value thresholding estimator (USVT), and
they provide an analysis under a similar model in which they assume f to be a bounded Lipschitz
function. They achieve a sample complexity, or the required fraction of measurements over the total
mn entries, which scales with the latent space dimension q according to Ω m−2/(q+2) for a square
matrix, whereas we achieve a sample complexity of Ω(m−1/2+δ ) (which is independent of q) as long
as the latent dimension scales as o(log n).
The term collaborative filtering was coined in [9], and this technique is widely used in practice due to
its simplicity and ability to scale. There are two main paradigms in neighborhood-based collaborative
filtering: the user-user paradigm and the item-item paradigm. To recommend items to a user in the
user-user paradigm, one first looks for similar users, and then recommends items liked by those
similar users. In the item-item paradigm, in contrast, items similar to those liked by the user are
found and subsequently recommended. Much empirical evidence exists that the item-item paradigm
performs well in many cases [16, 14, 22], however the theoretical understanding of the method has
been limited. In recent works, Latent mixture models or cluster models have been introduced to
explain the collaborative filtering algorithm as well as the empirically observed superior performance
of item-item paradigms, c.f. [12, 13, 1, 2, 3]. However, these results assume a specific parametric
model, such as a mixture distribution model for preferences across users and movies. We hope that
by providing an analysis for collaborative filtering within our broader nonparametric model, we can
provide a more complete understanding of the potentials and limitations of collaborative filtering.
The algorithm that we propose in this work is inspired by local functional approximations, specifi-
cally Taylor’s approximation and classical kernel regression, which also relies on local smoothed
approximations, c.f. [18, 26]. However, since kernel regression and other similar methods use explicit
knowledge of the input features, their analysis and proof techniques do not extend to our context of
Blind regression, in which the features are latent. Although our estimator takes a similar form of
computing a convex combination of nearby datapoints weighted according to a function of the latent
distance, the analysis required is entirely different.

2
Contributions. The key contribution of our work is in providing a statistical framework for nonpara-
metric regression over latent variable models. We refrain from any specific modeling assumptions
on f , keeping mild regularity conditions aligned with the philosophy of non-parametric statistics.
We assume that the latent features are drawn independently from an identical distribution (IID) over
bounded metric spaces; the function f is Lipschitz with respect to the latent spaces; entries are ob-
served independently with some probability p; and the additive noise in observations is independently
distributed with zero mean and bounded support. In spite of the minimal assumptions of our model,
we provide a consistent matrix completion algorithm with finite sample error bounds. Furthermore,
as a coincidental by-product, we find that our framework provides an explanation of the practical
mystery of “why collaborative filtering algorithms work well in practice”.
There are two conceptual parts to our algorithm. First, we derive an estimate of f (x1 (u), x2 (i)) for
an unobserved index pair (u, i) by using first order local Taylor approximation expanded around the
points corresponding to (u, i0 ), (u0 , i), and (u0 , i0 ). This leads to estimation that
ŷ(u, i) ≡ y(u0 , i) + y(u, i0 ) − y(u0 , i0 ) ≈ f (x1 (u), x2 (i)), (2)
as long as x1 (u0 ) is close to x1 (u) or x2 (i0 ) is close to x2 (i). In kernel regression, distances between
input features are used to upper bound the error of individual estimates, but since the latent features
are not observed, we need another method to determine which of these estimates are reliable.
Secondly, under mild regularity conditions, we upper bound the squared error of the estimate in (2)
by the the variance of the squared difference between commonly observed entries in rows (u, v) or
columns (i, j). We empirically estimate this quantity and use it similarly to distance in the latent
space in order to appropriately weight individual estimates to a final prediction. If we choose only the
datapoints with minimum empirical row variance, we recover user-user nearest neighbor collaborative
filtering. Inspired by kernel regression, we also propose using computing the weights according to a
Gaussian kernel applied to the minimum of the row or column sample variances.
As the main technical result, we show that the user-user nearest neighbor variant of collaborative
filtering method with our similarity metric yields a consistent estimator for any Lipschitz function
as long as we observe max(m−1+δ , n−1/2+δ ) fraction of the matrix with δ > 0. In the process, we
obtain finite sample error bounds, whose details are stated in Theorem 1. We compared the Gaussian
kernel variant of our algorithm to classic collaborative filtering algorithms and a matrix factorization
based approach (softImpute) on predicting user-movie ratings for the Netflix and MovieLens datasets.
Experiments suggest that our method improves over existing collaborative filtering methods, and
sometimes outperforms matrix-factorization-based approaches depending on the dataset.

2 Setup
Operating assumptions. There are m users and n movies. The rating of user u ∈ [m] for movie
i ∈ [n] is given by (1), taking the form y(u, i) = f (x1 (u), x2 (i)) + ηu,i . We make the following
assumptions.

(a) X1 and X2 are compact metric spaces endowed with metric dX1 and dX2 respectively:
dX1 (x1 , x01 ) ≤ BX , ∀ x1 , x01 ∈ X1 , and dX2 (x2 , x02 ) ≤ BX , ∀ x2 , x02 ∈ X2 . (3)

(b) f : X1 × X2 → R is L−Lipschitz with respect to ∞-product metric:


|f (x1 , x2 ) − f (x01 , x02 )| ≤ L max {dX1 (x1 , x01 ), dX2 (x2 , x02 )} , ∀x1 , x01 ∈ X1 , x2 , x02 ∈ X2 .

(c) The latent features of each user u and movie i, x1 (u) and x2 (i), are sampled independently
according to Borel probability measures PX1 and PX2 on (X1 , TX1 ) and (X2 , TX2 ), where
TX denotes the Borel σ-algebra of a metric space X .
(d) The additive noise for all data points are independent and bounded with mean zero and
variance γ 2 : for all u ∈ [m], i ∈ [n],

ηu,i ∈ [−Bη , Bη ], E[ηu,i ] = 0, Var[ηu,i ] = γ 2 . (4)

(e) Rating of each entry is revealed (observed) with probability p, independently.

3
Notation. Let random variable Mui = 1 if the rating of user u and movie i is revealed and 0
otherwise. Mui is an independent Bernoulli random variable with parameter p. Let N1 (u) denote the
set of column indices of observed entries in row u. Similarly, let N2 (i) denote the set of row indices
of observed entries in column i. That is,
N1 (u) , {i : M (u, i) = 1} and N2 (i) , {u : M (u, i) = 1}. (5)
For rows v 6= u, N1 (u, v) , N1 (u) ∩ N1 (v) denotes column indices of commonly observed entries
of rows (u, v). For columns i 6= j, N2 (i, j) , N2 (i) ∩ N2 (j) denotes row indices of commonly
observed entries of columns (i, j). We refer to this as the overlap between two rows or columns.

3 Algorithm Intuition

Local Taylor Approximation. We propose a prediction algorithm for unknown ratings based on
insights from the classical Taylor approximation of a function. Suppose X1 ∼ = X2 ∼= R, and we wish
to predict unknown rating, f (x1 (u), x2 (i)), of user u ∈ [m] for movie i ∈ [n]. Using the first order
Taylor expansion of f around (x1 (v), x2 (j)) for some u 6= v ∈ [m], i 6= j ∈ [n], it follows that
f (x1 (u), x2 (i)) ≈ f (x1 (v), x2 (j)) + (x1 (u) − x1 (v)) ∂f (x1 (v),x
∂x1
2 (j))
+ (x2 (i) − x2 (j)) ∂f (x1 (v),x
∂x2
2 (j))
.
We are not able to directly compute this expression, as we do not know the latent features, the
function f , or the partial derivatives of f . However, we can again apply Taylor expansion for
f (x1 (v), x2 (i)) and f (x1 (u), x2 (j)) around (x1 (v), x2 (j)), which results in a set of equations with
the same unknown terms. It follows from rearranging terms and substitution that
f (x1 (u), x2 (i)) ≈ f (x1 (v), x2 (i)) + f (x1 (u), x2 (j)) − f (x1 (v), x2 (j)),
as long as the first order Taylor approximation is accurate. Thus if the noise term in (1) is small, we
can approximate f (x1 (u), x2 (i)) by using observed ratings y(v, j), y(u, j) and y(v, i) according to
ŷ(u, i) = y(u, j) + y(v, i) − y(v, j). (6)

Reliability of Local Estimates. We will show that the variance of the difference between two rows
or columns upper bounds the estimation error. Therefore, in order to ensure the accuracy of the above
estimate, we use empirical observations to estimate the variance of the difference between two rows
or columns, which directly relates to an error bound. By expanding (6) according to (1), the error
f (x1 (u), x2 (i)) − ŷ(u, i) is equal to
(f (x1 (u), x2 (i)) − f (x1 (v), x2 (i))) − (f (x1 (u), x2 (j)) − f (x1 (v), x2 (j))) − ηvi + ηvj − ηuj .
If we condition on x1 (u) and x1 (v),
h i
2
E (Error) | x1 (u), x1 (v) = 2 Varx∼X2 [f (x1 (u), x) − f (x1 (v), x) | x1 (u), x1 (v)] + 3γ 2 .

Similarly, if we condition on x2 (i) and x2 (j) it follows that the expected squared error is bounded by
the variance of the difference between the ratings of columns i and j. This theoretically motivates
weighting the estimates according to the variance of the difference between the rows or columns.

4 Algorithm Description
We provide the algorithm for predicting an unknown entry in position (u, i) using available data.
Given a parameter β ≥ 2, define β-overlapping neighbors of u and i respectively as
Suβ (i) = {v s.t. v ∈ N2 (i), v 6= u, |N1 (u, v)| ≥ β},
Siβ (u) = {j s.t. j ∈ N1 (u), j 6= i, |N2 (i, j)| ≥ β}.
For each v ∈ Suβ (i), compute the empirical row variance between u and v,
1 X 2
s2uv = ((y(u, i) − y(v, i)) − (y(u, j) − y(v, j))) . (7)
2|N1 (u, v)|(|N1 (u, v)| − 1)
i,j∈N1 (u,v)

4
Similarly, compute empirical column variances between i and j, for all j ∈ Siβ (u),
1 X 2
s2ij = ((y(u, i) − y(u, j)) − (y(v, i) − y(v, j))) . (8)
2|N2 (i, j)|(|N2 (i, j)| − 1)
u,v∈N2 (i,j)

Let B β (u, i) denote the set of positions (v, j) such that the entries y(v, j), y(u, j) and y(v, i) are
observed, and the commonly observed ratings between (u, v) and between (i, j) are at least β.
n o
B β (u, i) = (v, j) ∈ Suβ (i) × Siβ (u) s.t. M (v, j) = 1 .

Compute the final estimate as a convex combination of estimates derived in (6) for (v, j) ∈ B β (u, i),
P
(v,j)∈B β (u,i) wui (v, j) (y(u, j) + y(v, i) − y(v, j))
ŷ(u, i) = P , (9)
(v,j)∈B β (u,i) wui (v, j)

where the weights wui (v, j) are defined as a function of (7) and (8). We proceed to discuss a few
choices for the weight function, each of which results in a different algorithm.

User-User or Item-Item Nearest Neighbor Weights. We can evenly distribute the weights only
among entries in the nearest neighbor row, i.e., the row with minimal empirical variance,
wvj = I(v = u∗ ), for u∗ ∈ arg min s2uv .
β
v∈Su (i)

If we substitute these weights in (9), we recover an estimate which is asymptotically equivalent to the
mean-adjusted variant of the classical user-user nearest neighbor (collaborative filtering) algorithm,
ŷ(u, i) = y(u∗ , i) + muu∗ ,
where muu∗ is the empirical mean of the difference of ratings between rows u and u∗ . For any u, v,
1 X
muv = (y(u, j) − y(v, j)).
|N1 (u, v)|
j∈N1 (u,v)

Equivalently, we can evenly distribute the weights among entries in the nearest neighbor columns, i.e.,
the column with minimal empirical variance, recovering the classical mean-adjusted item-item nearest
neighbor collaborative filtering algorithm. Theorem 1 proves that this simple algorithm produces
a consistent estimator, and we provide the finite sample error analysis. Due to the similarities, our
analysis also directly implies the proof of correctness and consistency for the classic user-user and
item-item collaborative filtering method.

User-Item Gaussian Kernel Weights. Inspired by kernel regression, we introduce a variant of


the algorithm which computes the weights according to a Gaussian kernel function with bandwith
parameter λ, substituting in the minimum row or column sample variance as a proxy for the distance,
wvj = exp(−λ min{s2uv , s2ij }).
When λ = ∞, the estimate only depends on the basic estimates whose row or column has the
minimum sample variance. When λ = 0, the algorithm equally averages all basic estimates. We
applied this variant of our algorithm to both movie recommendation and image inpainting data, which
show that our algorithm improves upon user-user and item-item classical collaborative filtering.

Connections to Cosine Similarity Weights. In our algorithm, we determine reliability of estimates


as a function of the sample variance, which is equivalent to the squared distance of the mean-
adjusted values. In classical collaborative filtering, cosine similarity is commonly used, which can be
approximated as a different choice of the weight kernel over the squared difference.

5 Main Theorem
Let E ⊂ [m] × [n] denote the set of user-movie pairs for which the algorithm predicts a rating. For
ε > 0, the overall ε-risk of the algorithm is the fraction of estimates whose error is larger than ε,
1 X
Riskε = I(|f (x1 (u), x2 (i)) − ŷ(u, i)| > ε). (10)
|E|
(u,i)∈E

5
In Theorem 1, we upper bound the expected ε-Risk, proving that the user-user nearest neighbor
estimator is consistent, i.e., in the presence of no noise, estimates converge to the true values as m, n
go to infinity. We may assume m ≤ n without loss of generality.
Theorem 1. For a fixed ε > 0, as long as p ≥ max{m−1+δ , n−1/2+δ } (where δ > 0), for any
ρ = ω(n−2δ/3 ), the user-user nearest-neighbor variant of our method with β = np2 /2 achieves
3ρ + γ 2 3 · 21/3 − 2 δ
      
1 δ δ 1 2δ
E[Riskε ] ≤ 1+ n 3 + O exp − Cm + m exp − 2 n 3 .
ε2 ε 4 5B
pρ 1
where B = 2(LBX + Bη ), and C = h L2 ∧ 6 for h(r) := inf x0 ∈X1 Px∼PX1 (dX1 (x, x0 ) ≤ r).

For a generic β, we can also provide precise error bounds of a similar form, with modified rates of
convergence. Choosing β to grow with np2 ensures that as n goes to infinity, the required overlap
between rows also goes to infinity, thus the empirical mean and variance computed in the algorithm
converge precisely to the true mean and variance. The parameter ρ in Theorem 1 is introduced purely
for the purpose of analysis, and is not used within the implementation of the the algorithm.
The function h behaves as a lower bound of the cumulative distribution function of PX1 , and it always
exists under our assumptions that X1 is compact. It is used to ensure that for any u ∈ [m], with high
probability, there exists another row v ∈ Suβ (i) such that dX1 (x1 (u), x1 (v)) is small, implying by
the Lipschitz condition that we can use the values of row v to approximate the values of row u well.
For example, if PX1 is a uniform distribution over a unit cube in q dimensional Euclidean space,
then h(r) = min(1, r)q , and our error bound becomes meaningful for n ≥ (L2 /ρ)q/2δ . On the other
hand, if PX1 is supported over finitely many points, then h(r) = minx∈supp(PX1 ) PX1 (x) is a positive
constant, and the role of the latent dimension becomes irrelevant. Intuitively, the “geometry” of PX1
through h near 0 determines the impact of the latent space dimension on the sample complexity, and
our results hold as long as the latent dimension q = o(log n).

6 Proof Sketch
For any evaluation set of unobserved entries E, the expectation of ε-risk is
1 X
E[Riskε ] = P(|f (x1 (u), x2 (i)) − ŷ(u, i)| > ε) = P(|f (x1 (u), x2 (i)) − ŷ(u, i)| > ε),
|E|
(u,i)∈E

because the indexing of the entries are exchangeable and identically distributed. To bound the
expected risk, it is sufficient to provide a tail bound for the probability of the error. For any fixed
a, b ∈ X1 , and random variable x ∼ PX2 , we denote the mean and variance of the difference
f (a, x) − f (b, x) by
µab , Ex [f (a, x) − f (b, x)] = E[muv |x1 (u) = a, x1 (v) = b],
2
σab , Varx [f (a, x) − f (b, x)] = E[s2uv |x1 (u) = a, x1 (v) = b] − 2γ 2 ,
which we point out is also equivalent to the expectation of the empirical means and variances
computed by the algorithm when we condition on the latent representations of the users. The
computation of ŷ(u, i) involves two steps: first the algorithm determines the neighboring row with the
minimum sample variance, u∗ = arg minv∈Suβ (i) s2uv , and then it computes the estimate by adjusting
according to the empirical mean, ŷ(u, i) := y(u∗ , i) + muu∗ .
The proof involves three key steps, each stated within a lemma. Lemma 1 proves that with high
probability the observations are dense enough such that there is sufficient number of rows with
overlap of entries larger than β, i.e., the number of the candidate rows, |Suβ (i)|, concentrates around
(m − 1)p. This relies on concentration of Binomial random variables via Chernoff’s bound.
Lemma 1. Given p > 0, 2 ≤ β ≤ np2 /2 and α > 0, for any (u, i) ∈ [m] × [n],
α2 (m − 1)p np2
   
β

P |Su (i)| ∈/ (1 ± α)(m − 1)p ≤ 2 exp − + (m − 1) exp − .
3 8

Lemma 2 proves that since the latent features are sampled iid from a bounded metric space, for any
index pair (u, i), there exists a “good” neighboring row v ∈ Suβ (i), whose σx2 1 (u)x1 (v) is small.

6
Lemma 2. Consider u ∈ [n] and set S ⊂ [n] \ {u}. Then for any ρ > 0,
   r |S|
ρ
P min σx2 1 (u)x1 (v) > ρ ≤ 1 − h ,
v∈S L2
where h(r) := inf x0 ∈X1 Px∼PX1 (dX1 (x, x0 ) ≤ r).

Subsequently, conditioned on the event that |Suβ (i)| ≈ (m − 1)p, Lemmas 3 and 4 prove that the
sample mean and sample variance of the differences between two rows concentrate around the
true mean and true variance with high probability. This involves using the Lipschitz and bounded
assumptions on f and X1 , as well as the Bernstein and Maurer-Pontil inequalities.
Lemma 3. Given u, v ∈ [m], i ∈ [n] and β ≥ 2, for any α > 0,
3βα2
 
P µx1 (u)x1 (v) − muv > α | v ∈ Suβ (i) ≤ exp −

,
6B 2 + 2Bα
where recall that B = 2(LBX + Bη ).
Lemma 4. Given u ∈ [m], i ∈ [n], and β ≥ 2, for any ρ > 0,
βρ2
   
P s2uv − (σx2 1 (u)x1 (v) 2
+ 2γ ) > ρ v∈ Suβ (i) ≤ 2 exp − 2 2 + 4γ 2 + ρ) ,
4B (2LBX
where recall that B = 2(LBX + Bη ).

Given that there exists a neighbor v ∈ Suβ (i) whose true variance σx2 1 (u)x1 (v) is small, and conditioned
on the event that all the sample variances concentrate around the true variance, it follows that the true
variance between u and its nearest neighbor u∗ is small with high probability. Finally, conditioned
on the event that |Suβ (i)| ≈ (m − 1)p and the true variance between the target row and the nearest
neighbor row is small, we provide a bound on the tail probability of the estimation error by using
Chevyshev inequalities. The only term in the error probability which does not decay to zero is the
error from Chebyshev’s inequality, which dominates the final expression, leading to the final result.

7 Experiments
We evaluated the performance of our algorithm to predict user-movie ratings on the MovieLens 1M
and Netflix datasets. For the implementation of our method, we used user-item Gaussian kernel
weights for the final estimator. We chose overlap parameter β = 2 to ensure the algorithm is able
to compute an estimate for all missing entries. When β is larger, the algorithm enforces rows (or
columns) to have more commonly rated movies (or users). Although this increases the reliability of
the estimates, it also reduces the fraction of entries for which the estimate is defined. We optimized
the λ bandwidth parameter of the Gaussian kernel by evaluating the method with multiple values for
λ and choosing the value which minimizes the error.
We compared our method with user-user collaborative filtering, item-item collaborative filtering,
and softImpute from [20]. We chose the classic mean-adjusted collaborative filtering method, in
which the weights are proportional to the cosine similarity of pairs of users or items (i.e. movies).
SoftImpute is a matrix-factorization-based method which iteratively replaces missing elements in the
matrix with those obtained from a soft-thresholded SVD.
For both MovieLens and Netflix data sets, the ratings are integers from 1 to 5. From each dataset, we
generated 100 smaller user-movie rating matrices, in which we randomly subsampled 2000 users and
2000 movies. For each rating matrix, we randomly select and withhold a percentage of the known
ratings for the test set, while the remaining portion of the data set is revealed to the algorithm for
computing the estimates. After the algorithm computes its predictions for unrevealed movie-user
pairs, we evaluate the Root Mean Squared Error (RMSE) of the predictions compared with the
withheld test set, where RMSE is defined as the square root of the mean of squared prediction error
over the evaluation set. Figure 1 plots the RMSE of our method along with classic collaborative
filtering and softImpute evaluated against 10%, 30%, 50%, and 70% withheld test sets. The RMSE is
averaged over 100 subsampled rating matrices, and 95% confidence intervals are provided.

7
Figure 1: Performance of algorithms on Netflix and MovieLens datasets with 95% confidence interval.
λ values used by our algorithm are 2.8 (10%), 2.3 (30%), 1.7 (50%), 1 (70%) for MovieLens, and 1.8
(10%), 1.7 (30%), 1.6 (50%), 1.5 (70%) for Netflix.

Figure 1 suggests that our algorithm achieves a systematic improvement over classical user-user
and item-item collaborative filtering. SoftImpute performs the worst on the MovieLens dataset,
but it performs the best on the Netflix dataset. This behavior could be due to different underlying
assumptions of low rank for matrix factorization methods as opposed to Lipschitz for collaborative
filtering methods, which could lead to dataset dependent performance outcomes.

8 Discussion
We introduced a generic framework of blind regression, i.e., nonparametric regression over latent
variable models. We allow the model to be any Lipschitz function f over any bounded feature space
X1 , X2 , while imposing the limitation that the input features are latent. This is applicable to a wide
variety of problems, including recommendation systems, but also includes social network analysis,
community detection, crowdsourcing, and product demand prediction. Many parametric models (e.g.
low rank assumptions) can be framed as a specific case of our model.
Despite the generality and limited assumptions of our model, we present a simple similarity based
estimator, and we provide theoretical guarantees bounding its error within the noise level γ 2 . The
analysis provides theoretical grounds for the popularity of similarity based methods. To the best of
our knowledge, this is the first provable guarantee on the performance of neighbor-based collaborative
filtering within a fully nonparametric model. Our algorithm and analysis follows from local Taylor
approximation, along with an observation that the sample variance between rows or columns is a good
indicator of “closeness”, or the similarity of their function values. The algorithm essentially estimates
the local metric information between the latent features from observed data, and then performs local
smoothing in a similar manner as classical kernel regression.
Due to the local nature of our algorithm, our sample complexity does not depend on the latent
dimension, whereas Chatterjee’s USVT estimator [6] requires sampling almost every entry when
the latent dimension is large. This difference is due to the fact that Chatterjee’s result stems from
showing that a Lipschitz function can be approximated by a piecewise constant function, which upper
bound the rank of the target matrix. This discretization results in a large penalty with regards to the
dimension of the latent space. Since our method follows from local approximations, we only require
sufficent sampling such that locally there are enough close neighbor points.
The connection of our framework to regression implies many natural future directions. We can
extend model (1) to multivariate functions f , which translates to the problem of higher order tensor
completion. Variations of the algorithm and analysis that we provide for matrix completion can
extend to tensor completion, due to the flexible and generic assumptions of our model. It would also
be useful to extend the results to capture general noise models, sparser sampling regimes, or mixed
models with both parametric and nonparametric or both latent and observed variables.

Acknowledgements: This work is supported in parts by ARO under MURI award 133668-5079809,
by NSF under grants CMMI-1462158 and CMMI-1634259, and additionally by a Samsung Scholar-
ship, Siebel Scholarship, NSF Graduate Fellowship, and Claude E. Shannon Research Assistantship.

8
References
[1] S. Aditya, O. Dabeer, and B. K. Dey. A channel coding perspective of collaborative filtering. IEEE
Transactions on Information Theory, 57(4):2327–2341, 2011.
[2] G. Bresler, G. H. Chen, and D. Shah. A latent source model for online collaborative filtering. In Advances
in Neural Information Processing Systems, pages 3347–3355, 2014.
[3] G. Bresler, D. Shah, and L. F. Voloch. Collaborative filtering with low regret. arXiv preprint
arXiv:1507.05371, 2015.
[4] D. Cai, X. He, X. Wu, and J. Han. Non-negative matrix factorization on manifold. In Data Mining, 2008.
ICDM’08. Eighth IEEE International Conference on, pages 63–72. IEEE, 2008.
[5] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational
mathematics, 9(6):717–772, 2009.
[6] S. Chatterjee et al. Matrix estimation by universal singular value thresholding. The Annals of Statistics,
43(1):177–214, 2015.
[7] M. Fazel, H. Hindi, and S. P. Boyd. Log-det heuristic for matrix rank minimization with applications to
hankel and euclidean distance matrices. In Proceedings of ACC, volume 3, pages 2156–2162. IEEE, 2003.
[8] R. S. Ganti, L. Balzano, and R. Willett. Matrix completion under monotonic single index models. In
Advances in Neural Information Processing Systems, pages 1864–1872, 2015.
[9] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information
tapestry. Commun. ACM, 1992.
[10] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In
Proceedings of the 45th annual ACM symposium on Theory of computing, pages 665–674. ACM, 2013.
[11] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. Inf. Theory,
56(6), 2009.
[12] J. Kleinberg and M. Sandler. Convergent algorithms for collaborative filtering. In Proceedings of the 4th
ACM conference on Electronic commerce, pages 1–10. ACM, 2003.
[13] J. Kleinberg and M. Sandler. Using mixture models for collaborative filtering. In Proceedings of the
thirty-sixth annual ACM symposium on Theory of computing, pages 569–578. ACM, 2004.
[14] Y. Koren and R. Bell. Advances in collaborative filtering. In Recommender Systems Handbook, pages
145–186. Springer US, 2011.
[15] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms for exact
recovery of a corrupted low-rank matrix. CAMSAP, 61, 2009.
[16] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering.
IEEE Internet Computing, 7(1):76–80, 2003.
[17] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to
system identification. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2010.
[18] Y. Mack and B. W. Silverman. Weak and strong uniform consistency of kernel regression estimates.
Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3):405–415, 1982.
[19] A. Maurer and M. Pontil. Empirical Bernstein Bounds and Sample Variance Penalization. ArXiv e-prints,
July 2009.
[20] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete
matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.
[21] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal
bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697, 2012.
[22] X. Ning, C. Desrosiers, and G. Karypis. Recommender Systems Handbook, chapter A Comprehensive
Survey of Neighborhood-Based Recommendation Methods, pages 37–76. Springer US, 2015.
[23] A. Rohde, A. B. Tsybakov, et al. Estimation of high-dimensional low-rank matrices. The Annals of
Statistics, 39(2):887–930, 2011.
[24] B.-H. Shen, S. Ji, and J. Ye. Mining discrete patterns via binary matrix factorization. In Proceedings of the
15th ACM SIGKDD international conference, pages 757–766. ACM, 2009.
[25] N. Srebro, N. Alon, and T. S. Jaakkola. Generalization error bounds for collaborative prediction with
low-rank matrices. In Advances In Neural Information Processing Systems, pages 1321–1328, 2004.
[26] M. P. Wand and M. C. Jones. Kernel smoothing. Crc Press, 1994.

You might also like