views:

29

answers:

1

Hi guys. I'm trying to collecting data from a frequently updating blog, so I simply use a while loop which includes urllib2.urlopen("http:\example.com") to refresh the page every 5 minutes to collect the data I wanted.

But I notice that I'm not getting the most recent content by doing this, it's different from what I see via browser such as Firefox, and after checking both the source code of Firefox and the same page I get from python, I found that it's WP Super Cache which is preventing me from getting the most recent result.

And I still get the same cache page even if I spoof the headers in my python code. So I wonder is there a way to by pass WP super cache? And why there's no such super cache in Firefox at all?

+1  A: 

Have you tried changing the URL with some harmless data? Something like this:

import time
urllib2.urlopen("http:\example.com?time=%s" % int(time.time()))

It will actually call http:\example.com?time=1283872559. Most caching systems will bypass the cache if there's a querystring or it's something that isn't expected.

Oli
Indeed; if you check [the WP Super Cache homepage](http://ocaoimh.ie/wp-super-cache/), one of the features there is "Don’t super cache any request with GET parameters." So you may not even need to vary your parameter. (WP Super Cache requires a custom permalink structure; it won't work with the default permalink setup that uses GET parameters.)
Matt Gibson
If I browse "http://example.com?time=1283872559" in Firefox, everything is fine and it just shows exactly as "http://example.com" is. But I will get a 400 bad request error using python, why is this happening?
Shane
@Matt Gibson: What puzzles me a lot is that, I always get live generated page every time I refresh the site in my Firefox simply by pressing F5, which means WP Super Cache does NOT work? But using urllib2.urlopen I always get cached page...
Shane
I noticed although there's still such line showing up: <!-- Cached page generated by WP-Super-Cache on 2010-09-07 17:07:09 -->, each time I request I will get a fresh new time which means it actually just generate the page. This method works! Thanks a lot to Oli and also Matt Gibson for detail explanation.
Shane