views:

22

answers:

1

I am attempting to determine the size of a downloaded file in python before parsing and manipulating it with BeautifulSoup. (I intend to update to ElementTree soon, but having briefly played with it, it does not solve the problem I am posing here, as far as I can see).

import urllib2, BeautifulSoup
query = 'http://myexample.file.com/file.xml'
f = urllib2.urlopen(query)
print len(f.read())
soup = BeautifulSoup.BeautifulStoneSoup(f.read())

This code falters because when I read() the file the first time in len(), it naturally reaches an EOF and so the file object is then empty by the time I want to access it with BeautifulSoup.

My inital thought was simply to copy the object with a fcopy = f line, but this led me to learn I'm merely referencing the underlying object and gain nothing.

I then thought that fcopy = copy.copy(f) would create a true copy of the object, but apparently not as reading f still results in fcopy being an empty file object.

I even read about passing objects as parameters to functions in order to get round this, and tried the following code

import urllib2, BeautifulSoup
def get_bytes(file):
    return len(file.read())

query = 'http://myexample.file.com/file.xml'
f = urllib2.urlopen(query)
print(get_bytes(f))
soup = BeautifulSoup.BeautifulStoneSoup(f.read())

But I had the same problem. How can I determine the file size of this object without effectively destroying the file?

+2  A: 

Copy the content of the file into a variable and work with it:

import urllib2, BeautifulSoup

query = 'http://myexample.file.com/file.xml'
f = urllib2.urlopen(query)
content = f.read()
print len(content)
soup = BeautifulSoup.BeautifulStoneSoup(content)
eumiro
doh! So flippin obvious! Thanks :)
fearoffours