views:

1161

answers:

4

Use HttpWebRequest to download web pages without key sensitive issues

+1  A: 

[update: I don't know why, but both examples below now work fine! Originally I was also seeing a 403 on the page2 example. Maybe it was a server issue?]

First, WebClient is easier. Actually, I've seen this before. It turned out to be case sensitivity in the url when accessing wikipedia; try ensuring that you have used the same case in your request to wikipedia.

[updated] As Bruno Conde and gimel observe, using %27 should help make it consistent (the intermittent behaviour suggest that maybe some wikipedia servers are configured differently to others)

I've just checked, and in this case the case issue doesn't seem to be the problem... however, if it worked (it doesn't), this would be the easiest way to request the page:

        using (WebClient wc = new WebClient())
        {
            string page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Algeria");

            string page2 = wc.DownloadString("http://en.wikipedia.org/wiki/%27Abadilah");
        }

I'm afraid I can't think what to do about the leading apostrophe that is breaking things...

Marc Gravell
I just tried the above code and it worked fine for both page1 and page2, what error were you receiving?
duckworth
@duckworth - OK, that is odd. When I posted, I was getting 403 on page2, but now it works! Maybe it was a server issue in the first place!
Marc Gravell
There are a pattern that when I make the request in C# it fail but if try to open it first using the browser and then make the C# request it's sometimes work. But I don't know where is the problem.It's weird...
Ifx64
@Haytham El-Fadeel: maybe it works if it can get it from the cache, but doesn't work for vanilla requests?
Marc Gravell
+1  A: 

I also got strange results ... First, the

http://en.wikipedia.org/wiki/'Abadilah

didn't work and after some failed tries it started working.

The second url,

http://en.wikipedia.org/wiki/'t_Zand_(Alphen-Chaam)

always failed for me...

The apostrophe seems to be the responsible for these problems. If you replace it with

%27

all urls work fine.

bruno conde
I try to make it %27 using HttpUtility.UrlPathEncode but it didn't work
Ifx64
+1  A: 

Try escaping the special characters using Percent Encoding (paragraph 2.1). For example, a single quote is represented by %27 in the URL (IRI).

gimel
+1  A: 

I'm sure the OP has this sorted by now but I've just run across the same kind of problem - intermittent 403's when downloading from wikipedia via a web client. Setting a user agent header sorts it out:

client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
Martynnw
ua fixed this for me - a anti-spamming "feature" of wikipedia?
Ben Aston