I am attempting to determine the size of a downloaded file in python before parsing and manipulating it with BeautifulSoup. (I intend to update to ElementTree soon, but having briefly played with it, it does not solve the problem I am posing here, as far as I can see).
import urllib2, BeautifulSoup
query = 'http://myexample.file.com/file.xml'
f = urllib2.urlopen(query)
print len(f.read())
soup = BeautifulSoup.BeautifulStoneSoup(f.read())
This code falters because when I read()
the file the first time in len()
, it naturally reaches an EOF and so the file object is then empty by the time I want to access it with BeautifulSoup.
My inital thought was simply to copy the object with a fcopy = f
line, but this led me to learn I'm merely referencing the underlying object and gain nothing.
I then thought that fcopy = copy.copy(f)
would create a true copy of the object, but apparently not as reading f still results in fcopy being an empty file object.
I even read about passing objects as parameters to functions in order to get round this, and tried the following code
import urllib2, BeautifulSoup
def get_bytes(file):
return len(file.read())
query = 'http://myexample.file.com/file.xml'
f = urllib2.urlopen(query)
print(get_bytes(f))
soup = BeautifulSoup.BeautifulStoneSoup(f.read())
But I had the same problem. How can I determine the file size of this object without effectively destroying the file?