Netflix Prize: Critic Subsets

Submitted by taylor on July 11, 2009 - 1:10am

I've been doing research in a new and growing area known as Collaborative Filtering. The rise of web 2.0 applications has had the side-effect of creating large amounts of data about large amounts of users - something that before the internet just didn't have a means to happen. Using this data as input to a combination of statistics and artificial intelligence techniques yields the ability to extrapolate and form predictions. Predicting user satisfaction of a given product or a service translates into personalized recommendations for some products over others. This gives ecommerce an innovative ability to form personalized recommendations for anything ranging from ads, to dating, to movies.

The scale of some of the databases can pose serious problems for real-time applications. For example, in the Netflix Prize data set there are 17770 movies and 480,189 users. Comparing each item to every other item yields ~300 million unique comparisons, but for users this becomes unwieldy with over 119 billion unique comparisons.

One research angle we considered aimed to reduce the computational complexity of comparisons. Commonly used methods like the Euclidean distance or Pearson's correlation coefficient require O(n2) time. We aimed to produce a O(k2) by choosing a subset of users, or critics, from which to extrapolate from.

First of all, a critic is someone (or some set of ratings) that serves as a strong predictor for some large set of users. This relates to real-life experiences as with movie critics like Siskel and Ebert or some fashion critique for which I have no example. Using this idea, we could lump users together and find a critic, or set of critics, and use their ratings as predictors to estimate the satisfaction of any given user for a particular movie.

Selecting "good" critics can be done in many ways. The main objective is to balance good predictions (low error) for a critic with having a large set of ratings for the critic. Consequently, a simple way of choosing the "best" critic for some specific, individual user would be the critic who's Euclidean distance is minimized when plotting all users in a two dimensional space with the following axis: the size of the intersection of the user/critic pair and the RMSE (or other measure of agreement) over that intersection. Of course we'll need to normalize the ranges if we depending on how important each axis is. This isn't close to perfect, but it's a reasonable starting point.

Despite settling on a ranking algorithm for the objective quality of a critic, populating a set of critics is still a non-trivial task. In fact, to conclusively arrive at the best set we must have examined all permutations which is essentially the travelling salesman problem. This NP (non-deterministic in polynomial time) problem can currently only be computed in an exponential time which makes application intangible. Since 2000 Cambridge University has offered $1,000,000 to anyone who can either prove that P=NP or vice versa. To learn more about NP problems, check out Wikipedia's NP page.

Our solution to creating a good, yet diverse group of critics was done by first selecting the best critic. We then choose the subsequent critic to be the critic who balances being most different from the first critic and being the most accurate. In practice, we decided to measure critic similarity, or uniqueness, in a similar way we choose a critic to represent a user. We choose the critic who maximizes the mean RMSE for all critic's ratings who have already selected into our set. Now that we have each critics similarity to the existing critic set, we choose the critic who is most different, but has the most number of ratings. Again, the axes will need to be normalized ti give proper weighting to the difference of the critic and the quality of the critic.

As we continued to develop the possibilities and limitations of choosing critics, the idea of using Google's PageRank algorithm was brought up. The PageRank algorithm works by assigning a probability that each node in the given graph is reached. Some nodes, because of a large number of incoming links, become authorities and have more strongly weighted outgoing links. Simply by finding eigenvectors of the matrix representation of the graph are they able to find these probabilities and thus, the ranking.

To apply the PageRank algorithm to the Netflix prize for finding critics, we came up with a few ideas. First, each node was going to be a user. An edge (or link) in the graph represented an agreement between users. If two users agreed on multiple movies their would be multiple edges between them (or one, more strongly weighted edge). This left one problem: directionality.

In PageRank, me linking to IBM is a much different scenario then IBM linking to me. Because IBM is an "authority", their link to me counts way more than my link to them. Thus, the direction of the link is very important. So if two users agree, how do we determine the direction of the link? We decided that we would use time to determine directionality. That is, if I agreed with a user's rating for some movie and that user rated before me, then my link goes from me to him.

If we do a bidirectional link, then the algorithm becomes simplified and would amount to each node getting a weight equal to the percentage of agreements it had over the the number of all agreements.

We had a thick debate about whether this was useful or not. I argued that because Netflix wasn't around during the production of most movies on its site that the order in which it received votes couldn't matter. Further, because Netflix has a user-based rating system, no users have the opportunity to see the movies first and publish their opinions. Thus, the ratings in Netflix are much different in texture than real life ratings. Despite this arguement, we all agreed this was an interesting and possibly insightful method for determining directionality.

I've attached my highly specific pagerank C++ implementation.