views:

81

answers:

1

I'm curious, how come I get 404 error running this line:

urllib2.urlopen("http://localhost/new-post#comment-29")

While everything works fine surfing http://localhost/new-post#comment-29 in any browser...

urlopen method does not parse urls with "#" in it?

Anybody knows?

+6  A: 

In the HTTP protocol, the fragment (from # onwards) is not sent to the server across the network: it's locally retained by the browser and used, once the server's response is fully received, to somehow "visually locate" the exact spot in the page to be shown as "current" (for example, if the returned page is in HTML, this will be done by parsing the HTML and looking for the first suitable <a> flag).

So, the procedure is: remove the fragment e.g. via urlparse.urlparse; use the rest to fetch the resource; parse it appropriately based on the server response's content-type header; then take whatever visual action your program does regarding the "current spot" on the resource, based on locating within the parsed resource the fragment you retained in the first step.

Alex Martelli
@Alex: Thanks a lot! Still another problem, if I use "post" method to communicate directly with the server through urlopen method, I will get a response url such as "http://localhost/new-post#comment-29" (e.g. when submitting comments), then it comes 404. If I want to bypass this problem, what alternative method can I use or is there some parameters to control the whole process?
Shane
@Shane, you need to remove the fragment before the URL's path and query are sent to the server -- you could build a fancy opener for the purpose, but I simply suggest doing it directly. Whether you use it for GET or POST makes no difference.
Alex Martelli
@Alex: Well, I don't think I have control over that, I mean, take comment submitting as an example, normally I would code like this: Req = urllib2.Request("http://localhost/wp-comments-post.php", Data, Header); Response = urllib2.urlopen(Req); I always get 404 while the comments still show up, I know it must be the "#" problem, but I just don't know how to stop the whole process and remove the "#" so it won't raise 404 any more. So how do I do this?
Shane
@Alex: Just found that mechanize would make things much easier, don't need to remove "#" in the url. Check it out: http://wwwsearch.sourceforge.net/mechanize/
Shane