In a previous post we trying to reduce the temporal complexity of computing correlation coefficients by choosing a clever set of "critics" to represent other the views of all users. This concept was ultimately unsuccessful for us, but it did yield some exciting new ideas.
We twisted this idea around and instead of reducing the complexity, we are trying to increase the accuracy of sample-based correlation coefficient estimations. That is, when using a small, incomplete amount of data in a statistical model, the statistical model can end up displaying random error instead of a true, underlying trend. This is an intuitive concept, frequently dubbed "inductive bias", and most simply presented in the Law of Large Numbers.
In the Netflix Prize data set, there exists some users, some movies, and some ratings for some user-movie pairs. If you imagined users listed on the X axis and movies along the Y axis, you have a matrix of all possible ratings for a user-movie pair. In the Netflix Prize data set, only 1% of these entries are populated and thus, 99% of the entries are missing. Thus, drawing a notion of similarity between users is using an incomplete set of data and often have absolutely 0 data for which to form an estimate. Our idea is to use extra information in the data set to improve such sparse estimations.
In short, the idea is something like this:
Two users, A and B, have rated several movies, but share no movies in common. Thus, standard statistical models such as the Pearson correlation coefficient have no data on which to operate. We propose the use of a third user, C, who represents the views of user A and user B. For example, we find a C who is very similar to A, but also shares some movies with B. We can then use C's similarity of B to represent A's similarity with B.
This idea of transitivity is both abstract and powerful, but it does have some obvious limitations. For example, it is well known that correlation does not equal causation and thus, our idea shouldn't work at all. In actuality, this isn't necessarily the case if we impose some intelligent constraints on what Cs we are willing to use. For example, if you have a friend who often likes the same movies as yourself, you wouldn't hesitate to takes his opinion on some new movies. You may disagree on one movie or another, but overall you can trust his opinion.
I'll post with back with the results.