views:

276

answers:

1

All,

I have a small webcrawler that sometimes has to crawl twitter and pull out URL's. I use a modified version of the Webclient class provided in the .net framework.

Normally this works fine, even with shortened URL's from sites such as bit.ly.

However, with the following url: http://is.gd/CioW The webclient times out.

Its meant to redirect you to here: http://digg.com/microsoft/Less_Virtual_More_Machine_Windows_7_and_the_magic_of_Boot

You think they're filtering certain clients?

Any ideas as to how I can fix this or why its happening?

A: 

Are you sure you can hit that URL from your network, without going through a proxy?

Does your webclient control follow redirects? You could test this by creating a TinyURL and see if your webclient can browse to it.

If you are going through a proxy in your browser, you'll need to set it up in the WebClient control.

It should be easy to test if they are filtering clients - set the UserAgent on the Request object to match that of FireFox for example.

Winston Smith
Yup, there is a proxy, but I've already configured the client to go through it. It works for every other link I throw at it.
Matthew Rathbone
Is the proxy blocking that specific URL, via content filtering software or something?
Winston Smith
It looks like its the site blocking unrecognized user-agents. When I set it to an IE7 string it worked fine. What's a safe user-agent string to use do you think?
Matthew Rathbone
I'd be inclined to go with a widely recognised webcrawler eg googlebot, if they accept that. But you run the risk of it becoming forbidden at some point in the future. Safest is probably IE or Mozilla.
Winston Smith
Ya, think I'll just go with IE7. Should be accepted everywhere really.
Matthew Rathbone