Skewed data

February 9, 2009 / A power law effect in Netflix user data makes it difficult to improve the recommendation system algorithms.

Satnam Alag on why it’s difficult to improve the Netflix recommendation system, even when you have a million dollar incentive:

The data set for the competition consists of more than 100 million anonymous movie ratings, using a scale of one to five stars, made by 480,000 users for 17,770 movies. Note that the user-item data set for this problem is sparsely populated, with nearly 99% of user-item entries being zero. The distribution of movies per user is skewed. The median number of ratings per user is 93. About 10% of users rated 16 or fewer movies, while 25% of users rated 36 or fewer. Two users rated as many as 17,000 movies. Similarly, the ratings per movie are also skewed: almost half the user base rated one popular movie (Miss Congeniality); about 25% of movies had 190 or fewer ratings; and a handful of movies were rated fewer than 10 times.

So regarding users: there is rich data about atypical users, and sparse data about the majority. And regarding movies: there is copious information about popular movies, and little about the long tail.

WTF?

4 responses

  1. Adrian Cooke

    I didn’t know about the article, thanks for the tip.

    February 9th, 2009 at 11:08 am #

  2. lyds

    Did you read the original NYT Magazine article on this? It’s pretty fascinating and goes into a lot more detail: http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html

    February 9th, 2009 at 10:43 am #

  3. Brian Christiansen

    Love the dagger, Adrian.

    February 9th, 2009 at 10:33 pm #

  4. Adrian Cooke

    Thanks Brian, good to see you posting again.

    February 10th, 2009 at 12:52 am #


Zero to One-Eighty contains writing on design, opinion, stories and technology.