0% found this document useful (0 votes)
93 views

How Itunes Genius Really Works

An engineer at Apple reveals how the company's premier recommendation engine parses millions of iTunes libraries. ITunes Genius is based on a packet of usage data--what songs a user has in his or her library. The data is "folded into a larger database of users and songs," says the engineer.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

How Itunes Genius Really Works

An engineer at Apple reveals how the company's premier recommendation engine parses millions of iTunes libraries. ITunes Genius is based on a packet of usage data--what songs a user has in his or her library. The data is "folded into a larger database of users and songs," says the engineer.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Technology Review in U.S.

| en Español | auf Deutsch | in Italiano | 中文 | in India

GUEST BLOG LOG IN

Analysis and insight from Username Password Submit


occasional correspondents and
decision makers.
Wednesday, June 02, 2010
How iTunes Genius Really Works
An Apple engineer discloses how the company's premier recommendation engine
Top Stories Of The Day parses millions of iTunes libraries.
» The New Age Sensors By Christopher Mims
» Retina Transplants from Stem
Cells Ever since the feature debuted in 2008, there's been a lot of speculation about how iTunes Genius
» Detecting Single Cancer accomplishes its playlist-building magic. Now an engineer at Apple that works on the iTune Genius
Molecules team has revealed some tantalizing clues--a rare disclosure for the infamously secretive company.
Recapitulating what Steve Jobs has said previously about iTunes Genius, Apple engineer Erik
Goldman writes in his post on Quora that the starting point for the Genius service is a packet of
usage data--what songs a user has in his or her library (and, presumably, how often he or she plays
them)--sent from the iTunes application which is "folded into a larger database of users and songs."

Basically, your library of tracks is compared to all the other Genius users' libraries of tracks. Apple
then runs a set of previously secret algorithms, which Goldman described as straightforward
recommendation algorithms similar to those used by other services like Netflix when it suggests
movies for a user to watch now or add to his quene, to generate statistics for each song. "These
statistics are computed globally at regular intervals and stored in a cache," notes Goldman, because
data on the similarity of any two songs changes slowly--it's assumed the only reason it changes at
all is because of the changing tastes of the listening public, and the introduction of new tracks and
artists.
Goldman jokes that if he told you how Genius works, he'd have to kill you (or at the least, have a
squad of police officers raid your brain to retrieve Apple's rightful property), but he continues to
describe how the program works anyway.
To uncover part of how iTunes Genius works, says Goldman, "look at information retrieval
algorithms, especially those that leverage the vector-space model." But before you can compare
factors, such as the frequency of a particular artist or genre in a user's library or playlists, across
iTunes libraries via a Vector-Space model, you need a clever way to define the factor that gives more
weight to the things that really matter.
A simple way to properly weight factors for comparison is what's known as term frequency-
inverse document frequency (tf-idf). It's simply a way to compare how often a particular factor
occurs in a single document (or song, or library) to how often that factor occurs in a larger body
such as the sum of all iTunes libraries stored by the Genius servers. Thus, a factor that occurs
pretty often in a given user's library--for example, an affinity for an obscure indy band--will tend to
be a more powerful determinant, unless it also happens to occur quite often in the total set of data-
-as would be the case if the factor was an affinity for the Beatles.
Once you've got your tf-idf weights sorted, you can represent them in a vector space model as
vectors.

In this example (courtesy Wikipedia) two different documents (or songs) have all their various tf-
idf weights represented as a single vector (e.g. d1) which can then be compared to a second
document / vector (e.g. d2) and a query (q)--such as "which of these two songs is most like the one
for which I've just clicked the 'genius' button." Whichever one is closer in angle to your query
vector is more similar.
Digging deeper into iTune Genius system, Goldman talks about its use of latent-factor algorithms.
"Latent-factor algorithms, in particular, tend to work very well on huge data sets with an enormous
number of dimensions and a lot of noise," says Goldman.
Latent factors are what shakes out when you do a particular kind of statistical analysis, called a
factor analysis, on a set of data, looking for the hidden, unseen variables that cause the variation in
all the different variables you're examining. Let's say that the variability in a dozen different
variables turns out to be caused by just four or five "hidden" variables--those are your latent
factors. They cause many other variables to move in more or less lock-step.
Discovering the hidden or "latent" factors in your data set is a handy way to reduce the size of the
problem that you have to compute, and it works because humans are predictable: people who like
Emo music are sad, and sad people also like the soundtracks to movie versions of vampire novels
that are about yearning, etc. You might think of it as the mathematical expression of a stereotype--
only it works.
If you want to go really deep on this subject, Goldman suggests you read the papers that came out
of the million-dollar Netflix Prize, which was won by a combination of teams led by engineers from
AT&T. Their challenge was to improve Netflix's recommendation engine, and one of their primary
innovations was reducing the computational intensity of the algorithms used in recommendation
engines.
Previously, the amount of computation required to do a pairwise comparison of any two items in
Netflix's (and presumably Apple's) library scaled as a quadratic function of the number of
comparisons to be performed. But the AT&T team figured out how to re-write a fundamental
algorithm to make the problem scale only linearly with the amount of data involved. So, whatever
Apple's new data center is for, it's probably not to calculate Genius results.

Comments Close Comments

Clarification
The article says "library scaled as a quadratic function of the
number of comparisons". There are at most a quadratic snedunuri
number of pairwise comparisons you can do, so this must 06/04/2010
Posts:50
mean that the algorithm was comparing comparisons?? Or Avg Rating:
should it say scaled quadratically with the amount of data?
Rate this comment:
(Reply)

Reply

« Previous Post Next Post »

Computing Web Communications Energy


Materials Biomedicine Business Today's Stories

About Us | Privacy | Terms of Use | Subscribe | Advertise | Sitemap | Contact Us | Feedback

Massachusetts Institute of Technology © 2010 Technology Review. All Rights Reserved.

You might also like