views:

16

answers:

0

I'm writing a .NET program that downloads web pages using proxy servers from a list. I'm running into the problem that sometimes a proxy server will return its own junk content, instead of returning the content from the intended page. The download will appear to succeed, but when you look at the downloaded HTML, it will have obviously bogus content, such as:

Welcome to XYZ proxy, have a great day.

I've implemented a routine to examine the HTML looking for known bogus strings, but this is really brittle, because new ones appear all the time, and the bad content gets past the obsolete filters. So now I'm thinking of some sort of Bayesian filter, since this is similar to spam filtering. But before I go to all of that trouble, I'm hoping someone here might know a more straightforward way to detect this situation, perhaps by examining headers or something. Anyway, thanks in advance for any help you may be able to offer.