Introduction To Recommender System
Introduction To Recommender System
Most large-scale commercial and social websites recommend options, such as products or
people to connect with, to users. Recommendation engines sort through massive amounts of
data to identify potential user preferences. This article, the first in a two-part series, explains the
ideas behind recommendation systems and introduces you to the algorithms that power them.
In Part 2, learn about some open source recommendation engines you can put to work.
LinkedIn, the business-oriented social networking site, forms recommendations for people
you might know, jobs you might like, groups you might want to follow, or companies you
might be interested in. LinkedIn uses Apache Hadoop to build its specialized collaborative-
filtering capabilities.
Amazon, the popular e-commerce site, uses content-based recommendation. When you
select an item to purchase, Amazon recommends other items other users purchased based
on that original item (as a matrix of item-to-likelihood-of-next-item purchase). Amazon
patented this behavior, called item-to-item collaborative filtering.
Netflix, the video rental and streaming service, is a famous example. In 2006, Netflix held
a competition to improve its recommendation system, Cinematch. In 2009, three teams
combined to build an ensemble of 107 recommendation algorithms that resulted in a single
prediction. This ensemble proved to be the key to improving predictive accuracy, and the
combined team won the prize.
Other sites that incorporate recommendation engines include Facebook, Twitter, Google,
MySpace, Last.fm, Del.icio.us, Pandora, Goodreads, and your favorite online news site. Use
of a recommendation engine is becoming a standard element of a modern web presence.
Recommendation systems changed the way inanimate websites communicate with their users.
Rather than providing a static experience in which users search for and potentially buy products,
recommender systems increase interaction to provide a richer experience. Recommender systems
identify recommendations autonomously for individual users based on past purchases and
searches, and on other users' behavior. This article introduces you to recommender systems
and the algorithms that they implement. In Part 2, learn about open source options for building a
recommendation capability.
Basic approaches
Most recommender systems take either of two basic approaches: collaborative filtering or content-
based filtering. Other approaches (such as hybrid approaches) also exist.
Collaborative filtering
Collaborative filtering arrives at a recommendation that's based on a model of prior user behavior.
The model can be constructed solely from a single user's behavior or — more effectively — also
from the behavior of other users who have similar traits. When it takes other users' behavior into
account, collaborative filtering uses group knowledge to form a recommendation based on like
users. In essence, recommendations are based on an automatic collaboration of multiple users
and filtered on those who exhibit similar preferences or behaviors.
For example, suppose you're building a website to recommend blogs. By using the information
from many users who subscribe to and read blogs, you can group those users based on their
preferences. For example, you can group together users who read several of the same blogs.
From this information, you identify the most popular blogs that are read by that group. Then —
for a particular user in the group — you recommend the most popular blog that he or she neither
reads nor subscribes to.
In the table in Figure 1, a set of blogs forms the rows, and the columns define the users. The
intersection of blog and user contains the number of articles read by that user of that blog. By
clustering the users based on their reading habits (for example, by using a nearest-neighbor
algorithm), you can see two clusters of two users each. Note the similarities in the reading habits
of the members of each cluster: Marc and Elise, who both read several articles about Linux® and
cloud computing, form Cluster 1. In Cluster 2 reside Megan and Jill, who both read several articles
about Java™ and agile.
Now you can identify some differences within each cluster and make meaningful
recommendations. In Cluster 1, Marc read 10 open source blog articles, and Elise read none;
Elise read one agile blog, and Marc read none. In Figure 1, then, one recommendation for Elise is
the open source blog. No recommendations can be made for Marc because the small difference
between him and Elise in agile blog reads would likely be filtered away. In Cluster 2, Jill read three
open source blogs, and Elise read none; Elise read 11 Linux blogs, and Jill read none. Cluster
2, then, carries a pair of recommendations: the Linux blog for Jill and the open source blog for
Megan.
Another way to view these relationships is based on their similarities and differences, as illustrated
in the Venn diagram in Figure 2. The similarities define (based on the particular algorithm used)
how to group users who have similar interests. The differences are opportunities that can be used
for recommendation — applied through a filter of popularity, for example.
Although Figure 2 is a simplification (and suffers from a sparseness of data by using only two
samples), it's a convenient representation.
Content-based filtering
Content-based filtering constructs a recommendation on the basis of a user's behavior. For
example, this approach might use historical browsing information, such as which blogs the user
reads and the characteristics of those blogs. If a user commonly reads articles about Linux or is
likely to leave comments on blogs about software engineering, content-based filtering can use this
history to identify and recommend similar content (articles on Linux or other blogs about software
engineering). This content can be manually defined or automatically extracted based on other
similarity methods.
Returning to Figure 1, focus on the user Elise. If you use a blog ranking that specifies that users
who read about Linux might also enjoy reading about open source and cloud computing, you can
easily recommend — on the basis of her current reading habits — that Elise read about open
source. This approach, illustrated in Figure 3, relies solely on content that a single user accesses,
not on the behavior of other users in the system.
The Venn diagram in Figure 2 also applies here: If one side is the user Elise and the other a
ranked set of similar blogs, the similarities are ignored (because those blogs were already read by
Elise), and the ranked differences are the opportunities for recommendation.
Hybrids
Hybrid approaches that combine collaborative and content-based filtering are also increasing the
efficiency (and complexity) of recommender systems. A simple example of a hybrid system could
use the approaches shown in Figure 1 and Figure 3. Incorporating the results of collaborative
and content-based filtering creates the potential for a more accurate recommendation. The hybrid
approach could also be used to address collaborative filtering that starts with sparse data —
known as cold start— by enabling the results to be weighted initially toward content-based filtering,
then shifting the weight toward collaborative filtering as the available user data set matures.
Pearson correlation
Similarity between two users (and their attributes, such as articles read from a collection of blogs)
can be accurately calculated with the Pearson correlation. This algorithm measures the linear
dependence between two variables (or users) as a function of their attributes. But it doesn't
calculate this measure over the entire population of users. Instead, the population must be filtered
down to neighborhoods based on a higher-level similarity metric, such as reading similar blogs.
The Pearson correlation, which is widely used in research, is a popular algorithm for collaborative
filtering.
Clustering algorithms
Clustering algorithms are a form of unsupervised learning that can find structure in a set of
seemingly random (or unlabeled) data. In general, they work by identifying similarities among
items, such as blog readers, by calculating their distance from other items in a feature space.
(Features in a feature space could represent the number of articles read in a set of blogs.) The
number of independent features defines the dimensionality of the space. If items are "close"
together, they can be joined in a cluster.
Many clustering algorithms exist. The simplest one is k-means, which partitions items into k
clusters. Initially, the items are randomly placed into clusters. Then, a centroid (or center) is
calculated for each cluster as a function of its members. Each item's distance from the centroids
is then checked. If an item is found to be closer to another cluster, it's moved to that cluster.
Centroids are recalculated each time all item distances are checked. When stability is reached
(that is, when no items move during an iteration), the set is properly clustered, and the algorithm
ends.
Calculating the distance between two objects can be difficult to visualize. One common method is
to treat each item as a multidimensional vector and calculate the distance by using the Euclidean
algorithm.
Other clustering variants include the Adaptive Resonance Theory (ART) family, Fuzzy C-means,
and Expectation-Maximization (probabilistic clustering), to name a few.
Other algorithms
Many algorithms — and an even larger set of variations of those algorithms — exist for
recommendation engines. Some that have been used successfully include:
• Bayesian Belief Nets, which can be visualized as a directed acyclic graph, with arcs
representing the associated probabilities among the variables.
• Markov chains, which take a similar approach to Bayesian Belief Nets but treat the
recommendation problem as sequential optimization instead of simply prediction.
• Rocchio classification (developed with the Vector Space Model), which exploits feedback of
the item relevance to improve recommendation accuracy.
other users do not exhibit typical behavior. These users can skew the results of a recommender
system and decrease its efficiency. Further, users can exploit a recommender system to favor
one product over another — based on positive feedback on a product and negative feedback on
competitive products, for example. A good recommender system must manage these issues.
Going further
Recommendation engines now power most of the popular social and commerce websites. They
provide tremendous value to the site's owners and to its users but also have some downsides.
This article explored some of the ideas behind recommendation systems and the algorithms that
power them. Part 2 of this series introduces you to some of the open source options available for
building a recommendation capability.
Resources
Learn
• Recommender Systems Wiki: Visit this site to access information about recommender
systems from a wide range of sources.
• Recommendations @ LinkedIn: Learn about LinkedIn's recommendation engine from this
presentation by Abihishek Gupta and Adil Aijaz.
• Hulu's Recommendation System: Find out how Hulu's recommendation engine helps content
owners promote their content on Hulu's framework.
• Netflix Recommendations: Beyond the 5 stars: Learn about the Netflix recommendation
engine, the Netflix prize, and the issues that arose in the competition for the prize.
• Web-Scale User Modeling for Targeting: Read this paper from Yahoo! to learn about a user-
modeling platform for optimizing ad selection for users on an Internet scale with Hadoop.
• A Survey of Collaborative Filtering Techniques: Read an in-depth treatment of recommender
systems and their challenges, and a useful survey of collaborative-filtering techniques.
• AI Application Programming, 2nd edition (M. Tim Jones, Cengage Learning, 2005): Learn
more about recommender systems and artificial intelligence in general from this book.
• Patterns In and Across Aggregated Data — Is "Anonymous" Collaborative Filtering Really
Safe?: Kent Anderson presents some of the dangers and privacy risks of collaborative
filtering.
• How Target Figured Out a Teen Girl Was Pregnant Before Her Father Did: Get a perspective
on the unsettling nature of recommendation systems and the potential risks behind item-
based collaborative filtering.
• The developerWorks Open source technical topic: Find extensive how-to information, tools,
and project updates to help you develop with open source technologies and use them with
IBM products.
Discuss
• Get involved in the developerWorks community. Connect with other developerWorks users
while exploring the developer-driven blogs, forums, groups, and wikis.