tags:

views:

4866

answers:

3

I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables. I'm new to python so I'm struggling to see how I use a file object (that urllib2 returns) to do this.

+17  A: 

You can use Python in interactive mode to search for solutions.

if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.

stesch
Thanks for the in-depth answer (especially about finding object attributes/methods). .read() worked perfectly.
Oli
Excellent answer from the 'teaching to fish' school. I would give you +2 if I could!
Will Dean
+3  A: 

From the doc file.read() (my emphasis):

file.read([size])

Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.

Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).

gimel
+4  A: 

Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here: urllib2 - The Missing Manual

What you are doing should be pretty straightforward, observe this sample code:

import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()
David in Dakota