views:

167

answers:

5

Hello, I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?

You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.

Here's an example: Say the input is buy aspirin in south la. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin (I don't want to make an active link)

Here's the first paragraph of the text:

The bare of purchasing prescription drugs from Canada is big in the U.S. at this moment. This is because in the U.S. prescription drug prices bang skyrocketed making it arduous for those who bang limited or concentrated incomes to buy their much needed medications. Americans pay more for their drugs than anyone in the class.

The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.

+3  A: 

Define 'quality' of a web - page? What is the metric?

If someone was looking to buy fruit, then searching for 'big sweet melons' will give many results that contain images of a 'non textile' slant.

The markup and hosting of those pages may however be sound engineering ..

But a page of a dirt farmer presenting his high quality, tasty and healthy produce might be visible only in IE4.5 since the html is 'broken' ...

lexu
I general, users make pages of good quality, spammers make pages of low quality. Please take a look at the example I've just added
roddik
@roddik: Please take a closer look at this site (all four sites in the trilogy, actually). Some of the questions here are hard to read, full of spelling and grammar errors (mine too!). Yet they are interesting/to the point. Other questions are well put, but utter junk. Linking language/grammar to quality is IMHO questionable, and borders on being elitist.
lexu
I think the question might be talking about pages with **autogenerated text**. It should be possible to detect many of those.
dmcer
+1  A: 

For each result set per keyword query, do a separate google query to find number of sites linking to this site, if no other site links to this site, then exclude it. I think this would be a good start at least.

RandyMorris
link farm ... won't work without additional factors!
lexu
Perhaps I am naive, but it was stated to be a university project. Additionally, google itself uses this factor in deciding relevancy.
RandyMorris
makes sense, link farms don't show up in Google backlinks
roddik
+5  A: 

N-gram Language Models

You could try training one n-gram language model on the autogenerated spam pages and one on a collection of other non-spam webpages.

You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content.

Better Scoring through Bayes Law

When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, P(Text|Spam). The notation reads as the probability of Text given Spam (page). The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, P(Text|Non-Spam).

However, the term you probably really want is P(Spam|Text) or, equivalently P(Non-Spam|Text). That is, you want to know the probability that a page is Spam or Non-Spam given the text that appears on it.

To get either of these, you'll need to use Bayes Law, which states

           P(B|A)P(A)
P(A|B) =  ------------
              P(B)

Using Bayes law, we have

P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text)

and

P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text)

P(Spam) is your prior belief that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually tune to trade-off precision and recall. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.

The term P(Text) is the overall probability of finding Text on any webpage. If we ignore that P(Text|Spam) and P(Text|Non-Spam) were determined using different models, this can be calculated as P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam). This sums out the binary variable Spam/Non-Spam.

Classification Only

However, if you're not going to use the probabilities for anything else, you don't need to calculate P(Text). Rather, you can just compare the numerators P(Text|Spam)P(Spam) and P(Text|Non-Spam)P(Non-Spam). If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both P(Spam|Text) and P(Non-Spam|Text) are normalized by the same P(Text) value.

Tools

In terms of software toolkits you could use for something like this, SRILM would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use IRST LM, which is distributed under the LGPL.

dmcer
+1  A: 

if you are looking for performance related metrics then Y!Slow [plugin for firefox] could be useful.

http://developer.yahoo.com/yslow/

daedlus
A: 

You can use a supervised learning model to do this type of classification. The general process goes as follows:

  1. Get a sample set for training. This will need to provide examples of documents you want to cover. The more general you want to be the larger the example set you need to use. If you want to just focus on websites related to aspirin then that shrinks the necessary sample set.

  2. Extract features from the documents. This could be the words pulled from the website.

  3. Feed the features into a classifier such as ones provided in (MALLET or WEKA).

  4. Evaluate the model using something like k-fold cross validation.

  5. Use the model to rate new websites.

When you talk about not caring if you mark a good site as a bad site this is called recall. Recall measures of the ones you should get back how many you actually got back. Precision measures of the ones you marked as 'good' and 'bad' how many were correct. Since you state your goal to be more precise and recall isn't as important you can then tweak your model to have higher precision.

Thien