views:

88

answers:

4

Given a list of (say) songs, what's the best way to determine their relative "popularity"?

My first thought is to use Google Trends. This list of songs:

  1. Subterranean Homesick Blues
  2. Empire State of Mind
  3. California Gurls

produces the following Google Trends report: (to find out what's popular now, I restricted the report to the last 30 days)

Empire State of Mind is marginally more popular than California Gurls, and Subterranean Homesick Blues is far less popular than either.

So this works pretty well, but what happens when your list is 100 or 1000 songs long? Google Trends only allows you to compare 5 terms at once, so absent a huge round-robin, what's the right approach?

Another option is to just do a Google Search for each song and see which has the most results, but this doesn't really measure the same thing

+2  A: 

You could search for the item on Twitter and see how many times it is mentioned. Or look it up on Amazon to see how many people have reviewed it and what rating they gave it. Both Twitter and Amazon have APIs.

Simon Brown
+2  A: 

There is an unoffical google trends api. See http://zoastertech.com/projects/googletrends/index.php?page=Getting+Started I have not used it but perhaps it is of some help.

Anon
+1  A: 

I would certainly treat Google's API of "restricted".

In general, comparison functions used for sorting algorithms are very "binary":

  • input: 2 elements
  • output: true/false

Here you have:

  • input: 5 elements
  • output: relative weights of each element

Therefore you will only need a linear number of calls to the API (whereas sorting usually requires O(N log N) calls to comparison functions).

You will need exactly ceil( (N-1)/4 ) calls. That you can parallelize, though do read the user guide closely as for the number of requests you are authorized to submit.

Then, once all of them are "rated" you can have a simple sort in local.

Intuitively, in order to gather them properly you would:

  • Shuffle your list
  • Pop the 5 first elements
  • Call the API
  • Insert them sorted in the result (use insertion sort here)
  • Pick up the median
  • Pop the 4 first elements (or less if less are available)
  • Call the API with the median and those 4 first
  • Go Back to Insert until your run out of elements

If your list is 1000 songs long, that 250 calls to the API, nothing too scary.

Matthieu M.
+2  A: 

Excellent question - one song by Britney Spears, might be phenomenally popular for 2 months then (thankfully) forgotten, while another song by Elvis might have sustained popularity for 30 years. How do you quantitatively distinguish the two? We know we want to think that sustained popularity is more important than a "flash in the pan", but how to get this result?

First, I would normalize around the release date - Subterranean Homesick Blues might be unpopular now (not in my house, though), but normalizing back to 1965 might yield a different result.

Since most songs climb in popularity, level off, then decline, let's choose the area when they level off. One might assume that during that period, that the two series are stationary, uncorrelated, and normally distributed. Now you can just apply a test to determine if the means are different.

There's probably less restrictive tests to determine the magnitude of difference between two time series, but I haven't run across them yet.

Anyone?

Grembo