views:

150

answers:

2

If I have a directory on a remote web server that allows directory browsing, how would I go about to fetch all those files listed there from my other web server? I know I can use urllib2.urlopen to fetch individual files, but how would I get a list of all the files in that remote directory?

+3  A: 

If the webserver has directory browsing enabled, it will return a HTML document with links to all the files. You could parse the HTML document and extract all the links. This would give you the list of files.

You can use the HTMLParser class to extract the elements you're interested in. Something like this will work:

from HTMLParser import HTMLParser
import urllib

class AnchorParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
            if tag =='a':
                    for key, value in attrs:
                            if key == 'href':
                                    print value

parser = AnchorParser(HTMLParser)
data = urllib.urlopen('http://somewhere').read()
parser.feed(data)
Robert Christie
That does the trick indeed. Thanks for the suggestion!
tomlog
A: 

Why don't you use curl or wget to recursively download the given page, and limit it upto 1 level. You will save all the trouble of writing the script.

e.g. something like

wget -H -r --level=1 -k -p www.yourpage/dir
Anurag Uniyal
I want to use the retrieved files in my Python code, so it's easier for me to script it.
tomlog