They didn't mention this in python documentation. And recently I'm testing a website simply refreshing the site using urllib2.urlopen() to extract certain content, I notice sometimes when I update the site urllib2.urlopen() seems not get the newly added content. So I wonder it does cache stuff somewhere, right?
So I wonder it does cache stuff somewhere, right?
It doesn't.
If you don't see new data, this could have many reasons. Most bigger web service use server-side caching, for load-balancing, as example squid (transparent proxy) or memcached (accessed from inside the server process).
If the problem is caused by server-side caching, usally there's no way to force the server to give you the latest data.
For caching proxys like squid, things are different. Usally, squid adds some additional headers to the HTTP response (response().info().headers
).
If you see a header field called X-Cache
or X-Cache-Lookup
, this means that you aren't connected to the remote server directly, but over a transparent proxy.
If you have something like: X-Cache: HIT from proxy.domain.tld
, this means that the response you got is caches. The opposite is X-Cache MISS from proxy.domain.tld
, which means that the response is fresh.
I find it hard to believe that urllib2 does not do caching, because in my case, upon restart of the program the data is refreshed. If the program is not restarted, the data appears to be cached forever. Also retrieving the same data from Firefox never returns stale data.