If I have a directory on a remote web server that allows directory browsing, how would I go about to fetch all those files listed there from my other web server? I know I can use urllib2.urlopen to fetch individual files, but how would I get a list of all the files in that remote directory?
+3
A:
If the webserver has directory browsing enabled, it will return a HTML document with links to all the files. You could parse the HTML document and extract all the links. This would give you the list of files.
You can use the HTMLParser class to extract the elements you're interested in. Something like this will work:
from HTMLParser import HTMLParser
import urllib
class AnchorParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag =='a':
for key, value in attrs:
if key == 'href':
print value
parser = AnchorParser(HTMLParser)
data = urllib.urlopen('http://somewhere').read()
parser.feed(data)
Robert Christie
2009-11-09 08:29:24
That does the trick indeed. Thanks for the suggestion!
tomlog
2009-11-09 09:15:34
A:
Why don't you use curl or wget to recursively download the given page, and limit it upto 1 level. You will save all the trouble of writing the script.
e.g. something like
wget -H -r --level=1 -k -p www.yourpage/dir
Anurag Uniyal
2009-11-09 08:35:38
I want to use the retrieved files in my Python code, so it's easier for me to script it.
tomlog
2009-11-09 08:52:47