Netflix Prize

Eoin Lawless

Eoin@maths.tcd.ie

I remember reading about the Netflix Prize three years ago when it was first announced. Netflix offered a $1 million prize to anyone who could improve their movie ratig system by 10% or more. They offered a dataset of over 100 million ratings by 500 thousand viewers of 17 thousand movies. The task is to build a profile of each users tastes and predict their ratings for a selection of movies. The accuracy of a prediction is calculated by netflix, and each contestant's score is posted on a leaderboard, as the root mean square error (RMSE) from the actual ratings.

Three years later and the contest is still ongoing. As I've been doing a lot of reading in statistics recently, and the best way to learn is by doing, I decided to enter. Solutions to the problem generally include a blend of these methods

In mid June I signed up for the contest and downloaded the datasets. Then on June 26 it was announced that one team, BellKor's Pragmatic Chaos, had reached the 10% barrier and after one further month to allow teams make their last submissions the competition would end! Possibly a good thing for me, as I've discovered that trying to improve your solution can be somewhat addictive.

After two weeks I am at position 678 of about 5000 teams with a RMSE of 0.9076. My solution is based on SVD ala Simon Funk. I added two twists to SVD to improve upon its initial RMSE of 0.91453. The first adds date information and the second uses ranks rather than raw scores.

In the simplest form of Simon Funk SVD each user and movie has a number of features, k. The predicted score p for user u and movie m is

p[u][m] = sum over k of  movie_feature[k][m] * user_feature[k][u]
The movie and user features are first initialised to
sqrt( global average / number of features)
This gives a RMSE of about 1.04. The features are then trained using gradient descent over the list of known ratings r, one feature at a time.
error = r[u][m] - p[u][m]
movie_feature[k][m] = movie_feature[k][m] + learning_rate * error * user_feature[k][u]
user_feature[k][u] = user_feature[k][u] + learning_rate * error * movie_feature[k][m]
This training step is done aprroximately 100 times over the entire set of known ratings, and repeated for each of the features. The first feature tends to be high in blockbuster style movies, while later features tend to pick out other aspects such as romance, black and white classics, commedies. (This is a very simplified description). My quick and easy way of adding date information is to add two further sets of features, a pairing of movie and date of rating, and a pairing of user and date of rating. The hope is that the first captures seasonal effects such as Christmas and Halloween, while the second captures a users changing tastes over time. Date information has been added to SVD before, but not I believe, in as simple a manner. The dates of the ratings are binned, so that ratings made in the first month in the dataset are in bin one, ratings made in the second month are in bin two etc. The prediction now has two extra terms:
p[u][m] = sum over k of  movie_feature[k][m] * user_feature[k][u]
+ sum over k of  moviedate_feature[k][m] * datemovie_feature[k][d]
+ sum over k of  userdate_feature[k][m] * dateuser_feature[k][d]
(Though I found that I needed only about 10 features for the date aspects, rather than about 200 for the movie/user features). The learning rule for the new features is very similar, ie for the moviedate feature:
error = r[u][m] - p[u][m][d]
moviedate_feature[k][m] = moviedate_feature[k][m] + learning_rate * error * datemovie_feature[k][d]
datemovie_feature[k][d] = datemovie_feature[k][d] + learning_rate * error * moviedate_feature[k][m]
Finally, to prevent overtraining I used a regularisation term, as described in Simon Funk's blog. My second idea was an attempt to tackle the different ways people rate movies. Imagine two people with identical tastes in movies. One might give all movies a rating between 2 and 5. The other gives exactly one less to each movie. Their preferences are the same, but their calibration on the allowed range of 1 to 5 is different. My idea was that rather than training on the raw score, the features are trained using ranks. This way the impact of different scoring patterns is reduced a little. If a user has rated n movies, and given c[i] of them a rating of i, then for example, the normalised rank score of a movie given a real score of 1 is
5 * (c[1] + 1) /( 2 *n )
for a movie given a rating of i it is
5* (( c[i] + 1) /2 + sum c[j for j
The factor of five, is merely present so the the ratings are roughly in the range 1 to 5.
The term (c[i] + 1)/2 selects the middle rank if a number of movies have been given the same rating.

This method proved less useful than I had hoped. However, my submission to Netflix was actually a blend of the two methods (using the Gravity linear blending tool), so it still contributed. In fact one of the main findings to come out of the Netflix Prize is that a blend of multiple, preferably quite different techniques always beats any one technique in isolation.

Finally here are some of my results using the Probe10 dataset from the Gravity team.
MethodRMSE on Probe10 dataset
SVD 0.91453
SVD + date 0.908360
SVD + Normalized Ranks 0.920397
Blended Result 0.903485
and the Quiz dataset RMSE result on submission to Netflix is 0.9076.

I believe from reading the Netflix forums that with careful choice of learning rates and by training all features simultaneously, it is possible to achieve a Quiz RMSE of 0.898 using SVD with simultaneous training of all features, so there is certainly scope for further improvement of my method incorporating dates by tuning of learning parameters or by moving from individual to simultaneous training of features.

It's a fascinating machine learning problem with applications from general shopping recommender systems, to dating agencies to computer bug analysis (which is why I was reading up on stats initially).

Upate - 27 July 2009 - It's all over!

It looks like the competition is all over. A super team, consisting of most of the remaining top ten teams formed at the last minute. They called themselves The Ensemble and piped Bellkor's Pragmatic Chaos to the post - finishing with a 10.10% improvment over Cinematch, just 0.01 points better than BCP. My own final place is a very humble position of 662. I just regret not entering earlier!