ansaurus

Question

How to estimate the quality of a web page?

Answer 1

+3 A:

Define 'quality' of a web - page? What is the metric?

If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.

The markup and hosting of those pages may however be sound engineering ..

But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...

lexu 2010-05-01 07:07:46

I general, users make pages of good quality, spammers make pages of low quality. Please take a look at the example I've just added

roddik 2010-05-01 10:26:21

@roddik: Please take a closer look at this site (all four sites in the trilogy, actually). Some of the questions here are hard to read, full of spelling and grammar errors (mine too!). Yet they are interesting/to the point. Other questions are well put, but utter junk. Linking language/grammar to quality is IMHO questionable, and borders on being elitist.

lexu 2010-05-01 11:14:34

I think the question might be talking about pages with **autogenerated text**. It should be possible to detect many of those.

dmcer 2010-05-02 20:08:26

Answer 2

+1 A:

For each result set per keyword query, do a separate google query to find number of sites linking to this site, if no other site links to this site, then exclude it. I think this would be a good start at least.

RandyMorris 2010-05-01 07:45:44

link farm ... won't work without additional factors!

lexu 2010-05-01 09:11:50

Perhaps I am naive, but it was stated to be a university project. Additionally, google itself uses this factor in deciding relevancy.

RandyMorris 2010-05-01 09:50:23

makes sense, link farms don't show up in Google backlinks

roddik 2010-05-01 10:27:28

Answer 3

+5 A:

N-gram Language Models

You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.

You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.

Better Scoring through Bayes Law

When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).

However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.

To get either of these, you'll need to use Bayes Law, which states

           P(B|A)P(A)
P(A|B) =  ------------
              P(B)

Using Bayes law, we have

P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

and

P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.

The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.

Classification Only

However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.

Tools

In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.

dmcer 2010-05-01 20:45:26

Answer 4

+1 A:

if you are looking for performance related metrics then Y!Slow [plugin for firefox] could be useful.

http://developer.yahoo.com/yslow/

daedlus 2010-05-01 21:35:06

Answer 5

A:

You can use a supervised learning model to do this type of classification. The general process goes as follows:

Get a sample set for training. This will need to provide examples of documents you want to cover. The more general you want to be the larger the example set you need to use. If you want to just focus on websites related to aspirin then that shrinks the necessary sample set.
Extract features from the documents. This could be the words pulled from the website.
Feed the features into a classifier such as ones provided in (MALLET or WEKA).
Evaluate the model using something like k-fold cross validation.
Use the model to rate new websites.

When you talk about not caring if you mark a good site as a bad site this is called recall. Recall measures of the ones you should get back how many you actually got back. Precision measures of the ones you marked as 'good' and 'bad' how many were correct. Since you state your goal to be more precise and recall isn't as important you can then tweak your model to have higher precision.

Thien 2010-05-03 18:10:15

ansaurus

tags:

views:

answers:

How to estimate the quality of a web page?

related questions