Skewed data
Satnam Alag on why it’s difficult to improve the Netflix recommendation system, even when you have a million dollar incentive:
The data set for the competition consists of more than 100 million anonymous movie ratings, using a scale of one to five stars, made by 480,000 users for 17,770 movies. Note that the user-item data set for this problem is sparsely populated, with nearly 99% of user-item entries being zero. The distribution of movies per user is skewed. The median number of ratings per user is 93. About 10% of users rated 16 or fewer movies, while 25% of users rated 36 or fewer. Two users rated as many as 17,000 movies. Similarly, the ratings per movie are also skewed: almost half the user base rated one popular movie (Miss Congeniality†); about 25% of movies had 190 or fewer ratings; and a handful of movies were rated fewer than 10 times.
So regarding users: there is rich data about atypical users, and sparse data about the majority. And regarding movies: there is copious information about popular movies, and little about the long tail.
† WTF?
I didn’t know about the article, thanks for the tip.
February 9th, 2009 at 11:08 am #
Did you read the original NYT Magazine article on this? It’s pretty fascinating and goes into a lot more detail: http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html
February 9th, 2009 at 10:43 am #
Love the dagger, Adrian.
February 9th, 2009 at 10:33 pm #
Thanks Brian, good to see you posting again.
February 10th, 2009 at 12:52 am #