tags:

views:

87

answers:

3

hi all, I'm curious how to do normalizing of numbers for a ranking algorithm

let's say I want to rank a link based on importance and I have two columns to work with

so a table would look like

url | comments | views

now I want to rank comments higher than views so I would first think to do comments*3 or something to weight it, however if there is a large view number like 40,000 and only 4 comments then the comments weight gets dropped out.

So I'm thinking I have to normalize those scores down to a more equal playing field before I can weight them. Any ideas or pointers to how that's usually done?

thanks

+4  A: 

For each url, you could first normalize the comments and views to a percentile. For example,

 comment_percentile = (comments - min(comments)) / (max(comments) - min(comments))
 views_percentile = (views - min(views)) / (max(views) - min(views))

Then you could assign weights to each of the percentile values to compute the overall score.

 url_score = (comment_percentile_weight * comment_percentile) + (views_percentile_weight * views_percentile)

Additional strategies may involve eliminating outliers if the values cluster toward one end of the range.

btreat
I don't think that's how percentile works but I could be wrong
Joe Philllips
You are correct d03boy! Thanks for the catch. Hopefully the updated post works better.
btreat
Along the same lines, you could normalize each column to be equal to the % of the maximum, or even normalize them so that all items in a column sum to 1 (that is, make each one the % of total sum).
Justin L.
A: 

Importance is really a way of notifying the user about how interested he could be in the forum topic or a blog spot. In this case, you can't just multiply two numbers by different factors and add :)

What can you say about a blogpost with 2000 views and only one comment. Well, perhaps it's a spam post, or it was viewed by web-crawlers, or it's so boring that no one decided to comment on it.

In this case, we might want to look at a ratio of comments versus views. My original post would have an "interest ratio" of 1/2000 while this post, which got 28 views and 1 comment right now, it would get a score of 1/28.

The biggest ratio wins. By the way, if you are having ratios over one... well, start looking for bugs :)

Olek Beluga
A: 

A similar problem was discussed a few weeks ago in this SO topic: "Algorithm to calculate a page importance based on its views / comments".

I'll give the same advice I offered there: use linear regression on a representative distribution of comment/view counts for web pages to work out a weighting function.

Joel Hoff