Satnam Alag on why it’s difficult to improve the Netflix recommendation system, even when you have a million dollar incentive:
The data set for the competition consists of more than 100 million anonymous movie ratings, using a scale of one to five stars, made by 480,000 users for 17,770 movies. Note that the user-item data set for this problem is sparsely populated, with nearly 99% of user-item entries being zero. The distribution of movies per user is skewed. The median number of ratings per user is 93. About 10% of users rated 16 or fewer movies, while 25% of users rated 36 or fewer. Two users rated as many as 17,000 movies. Similarly, the ratings per movie are also skewed: almost half the user base rated one popular movie (Miss Congeniality†); about 25% of movies had 190 or fewer ratings; and a handful of movies were rated fewer than 10 times.
So regarding users: there is rich data about atypical users, and sparse data about the majority. And regarding movies: there is copious information about popular movies, and little about the long tail.