Hello, I'm doing a university project, that must gather and combine data on a user provided topic. The problem I've encountered is that Google search results for many terms are polluted with low quality autogenerated pages and if I use them, I can end up with wrong facts. How is it possible to estimate the quality/trustworthiness of a page?
You may think "nah, Google engineers are working on the problem for 10 years and he's asking for a solution", but if you think about it, SE must provide up-to-date content and if it marks a good page as a bad one, users will be dissatisfied. I don't have such limitations, so if the algorithm accidentally marks as bad some good pages, that wouldn't be a problem.
Here's an example:
Say the input is buy aspirin in south la
. Try to Google search it. The first 3 results are already deleted from the sites, but the fourth one is interesting: radioteleginen.ning.com/profile/BuyASAAspirin
(I don't want to make an active link)
Here's the first paragraph of the text:
The bare of purchasing prescription drugs from Canada is big in the U.S. at this moment. This is because in the U.S. prescription drug prices bang skyrocketed making it arduous for those who bang limited or concentrated incomes to buy their much needed medications. Americans pay more for their drugs than anyone in the class.
The rest of the text is similar and then the list of related keywords follows. This is what I think is a low quality page. While this particular text seems to make sense (except it's horrible), the other examples I've seen (yet can't find now) are just some rubbish, whose purpose is to get some users from Google and get banned 1 day after creation.