My project requires me to validate a large number of web URLs. These URLs have been captured by a very unreliable process which I do not control. All of the URLs have already been regexp validated and are known to be well-formed. I also know that they all have valid TLDs
I want to be able to filter these URLs quickly in order to determine which of these are incorrect. At this point I do not care what content is on the pages - I'd just like to know as quickly as possible which of the pages are inaccessible (e.g. produce a 404 error).
Given that there are a lot of these I do not want to download the entire page, just the HTTP header and then take a good guess from the content of the header whether the page is likely to exist.
Can it be done?