views:

73

answers:

2

I have the constant need to Google error messages to try to find out what they mean and what solutions people might have. Usually those error messages were posted in some newsgroup. Unfortunately, Google's results (even when it says "similar results removed") usually consist of many copies of the exact same message (usually with some non-answer) replicated among many different newsgroup archives. Even just Google Groups results will be there three or four times, counting different localizations of Google Groups. How do you sort through huge piles of the exact same newsgroup posts?

+1  A: 

Try using a google custom search like Dan Appleman's SearchDotNet first. That will eliminate alot of duplicates.

Doug L.
A: 

If it is really a bee in your bonnet I'm afraid you are going to have to write some custom code to deal with this. Why -- just because it isn't such a big problem that the general community is going to solve it in a general way.

The thing can be done in three or fours hours work, you'll just have to dig in, you'll be glad you did.

I'm a fan of Linux platforms and Perl + Mechanize. In this case you would use Mechanize to do the Google search and then find common paragraphs in each answer and kill the dups. Of course Mechanize like libs are available in Ruby and Python, and surely similar ideas exist for Java and the Windows flavors.

BTW this problem is made worse by the fact that lots of web sites exist just to replicate known answers in hopes of ad revenue. Disputable really, they should be hunted down, but again it isn't such a huge problem that most people are willing to take action.

Jeff