views:

804

answers:

5

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.

For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.

The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:

GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
+2  A: 

Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles with W1+W2=1

Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)

Full Decent
What's the use of `log_10`? why are you adding 10 to the views/comments? can you explain your algorithm?
stacker
@stacker The log 10 is to compress a wide range into a smaller range. E.g. A page with the maximum number of views ranks 1, while page with 10x fewer views ranks much less than 10x smaller. This is often useful when number of views spans a large range from a handful to hundreds of thousands. The +10 is to keep log(0) out of the computation - log(0) is not defined - some say it's negative infinity - but either way, log(0) is not useful here, so the formula adds a constant to avoid it.
mdma
@mdma so I can use 1 instead 10. is 10 make more sense to avoid log(1) ?
stacker
@stacker It's also to avoid a divide by zero. Log(1) is 0, so when there are no maxViews then you'll get a divide by zero error. Stick with 10, it's the safest bet, and makes very little difference to the final result.
mdma
@mdma Right, log(1) is 0. so technically I can use +2, but +10 it written more nicely...
stacker
Using the log function, I'm getting 0.4 priority, to 0.0 without the log function. With the log function, the smallest number is 0.4.
stacker
@Stacker - This will change as your maxViews/maxComments increases - the formula is dynamic.
mdma
It's nice algorithm, but it still can be better.
stacker
@Stacker - see my answer. If this doesn't meet your needs, please update your question with additional details.
mdma
+8  A: 

You mentioned doing this in an SQL query, so I'll give samples in that.

If you have a table/view Pages, something like this

Pages
-----
page_id:int
views:int  - indexed
comments:int - indexed

Then you can order them by writing

SELECT * FROM Pages
ORDER BY 
    (0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +       
    (0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))

I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.

The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.

  0.3*(score for views/comments today) - live data
  0.3*(score for views/comments in the last week)
  0.25*(score for views/comments in the last month)
  0.15*(score for all views/comments, all time)

This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.

EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.

// unnormalized - x is some page id un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) + 0.7*log(comments(x)+10)/log(10+maxComments()) // the original formula (now in pseudo code)

The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.

we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)

To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x

                  (1.0-un(x)) * (un(0)-0.1)
  n(x) = un(x) -  -------------------------    when un(0) != 1.0
                          1.0-un(0)

       = 0.1 otherwise.
mdma
Thanks for the answer. I'm more concerned about the algorithm rather the implementation. This algorithm looks good, but it's not perfect at all. right now I got the lowest article: few views and no comments to be in 0.4 priority rather than 0.1. but as you said, this is dynamic. about the formula for hot pages, this is really interesting but I don't this is another topic, since this post is about sitemap.xml file. but maybe you're right, and I is worth do add this logic to the sitemap.xml. I don't sure.
stacker
@Stacker - Is the absolute value of the ranking important? I thought that only the relative ordering is important - the absolute values serve only to define the ordering. Is there something I am missing?
mdma
Yes, I'm looking for the absolute value. the order isn't by the priority field, but it's by the LastModified field. please take a look on stackoverflow sitemap.xml file to have more understanding on this file. Because access to this page is restrict by whitelist IP's, you can see this file only through google.
stacker
You can see the cached file on google: http://www.google.com/search?q=sitemaps.xml+site:stackoverflow.com+filetype:xml
stacker
It is possible to get the range from 0.1 to 1.0, but this will involve additional computation. And it really is not necessary. Take a look at the wikipedia article - it states that the priority value is to show how important the URL is **relative to other URLs on the site**. See http://en.wikipedia.org/wiki/Sitemaps. Forcing any kind of range for priority can only be for aesthetics, and doesn't alter how the search engine prioritizes pages.
mdma
Right. so as I said this algorithm isn't perfect but good enough. if you can have another algorithm / resource, this will be great. btw, I don't understand why to have log10 function both in the numerator and the denominator and not just to the all fraction.
stacker
How is making the range explicitly 0..1 any more perfect - it will give the same results regardless of the range. You could have all your priorities in the range 0.99-1.00 - as long as they are relatively ordered, that is all that matters. Having the lowest value of 0.1 is not a requirement of the sitemap.
mdma
Why to divide by the `MAX(views)` and not be `average(views)`? Take SO for example, they may also rank by votes, and the average value make more sense here then the max.
stacker
The log is there so that if you have a page with a million views and a page with 1000 views and one with 1 view, you don't get priorities like 1, 0.0001, and 0.0000001. It will compress the dynamic range, to something like 0.9,0.5,0.1. The ordering will still be the same, it just requires fewer digits of precision.
mdma
Using the average won't make any difference to the relative ordering,you're only changing the baseline for comparison. My original formula is "how far away from the max score is this page", and you are suggesting "how far away from the avgerage score is this page". All pages will still be ordered the same.
mdma
Please just try the formula in practice. SO isn't the place to learn math. If you then see that that two pages are not getting the relative priorities as you would like, then we can look into that, as there may be details you've not mentioned. Otherwise, I think this will do the job you want.
mdma
Did you try the formula for real? Since you requested it, I've updated my answer with a formula that will ensure the range is strictly 0.1 to 1.0. But bear in mind that this is not necessary and will make no difference to the relative priority that the search engine uses to prioritize URLs from your site.
mdma
I did try and use it for real. I'm using @Full Decent's formula to get the relative order, my formula is: 0.2 * log10(10+views) / losg10(10+maxViews) + 0.8 * log10(10+comments) / log10 (10+maxComments). for this values: views = 100, comments = 0, the result is 0.4. this is don't make sense. this is have to do with the +10 value and the max value instead the average.
stacker
As you said, the sitemap is good with relative order. but I was look to the perfect value, that really show the value of an article. anyway, you helped me with this problem and I'm giving the bounty to you.
stacker
btw, can you elaborate about the hot topics solution you mentioned? how it should be implemented? a bunch of queries with onion?
stacker
the hot topic formula will be a series of formula in the WHERE clause. Imagine you have a table where each page is a row, with columns for the views/comments organized by date in the 4 timescales. (4 columns for views, 4 for commnets) Your where clause then computed from these columns by applying the formula 4 times for each timescale.
mdma
A: 

The most naive approach would be the following:

Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be

r_v(i) = v[i]/(sum_j v[j])

where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be

r_c(i) = c[i]/(sum_j c[j]).

Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.

Then set the priority to be

p*r_v(i) + (1-p)*r_c(i)

This might be over-simplistic but its probably the best starting point.

Il-Bhima
I'm looking for more sophisticated algorithm, that will give an absolute value.
stacker
What do you mean by absolute value? Could you explain a bit more on this?
Il-Bhima
+3  A: 

What you're looking for here is not an algorithm, but a formula.

Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.

Instead, let's try to walk through the problem together.

You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.

You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.

Now, the first thing we need to decide is the relative weight of Comments vs Views.

If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.

If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like

 Priority = 5 * Comments + Views

Obviously, this can be generalized to

Priority = A * Comments + B * Views

Where A and B are relative weights.

But, sometimes we want our weights to be exponential instead of linear, like

 Priority = Comment ^ A + Views ^ B

which will give a very different curve than the earlier formula.

Similarly,

 Priority = Comment ^ A * Views ^ B

will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.

So, to summarize:

You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.

We can't do it for you, because we don't know how you want to value things.

Michael Dorfman
+1 Clear and concise.
Caleb Thompson
+1  A: 

What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.

This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.

A quick summary of how to apply it to your problem is:

  1. Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
  2. Use the line parameters to derive your priority function for the view/count parameters.

Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.

Joel Hoff