ansaurus

Question

Answer 1

+1 A:

Apache servers are very common, and they have a relatively standard way of listing file directories.

Here's a simple enough script that does what you want, you should be able to make it do what you want.

Usage: python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url)

sysrqb 2009-03-26 19:21:33

Thank you sysrqb, nice. Where might one have learned this ?Also, is there any way of running $(unzip -l remote.zip) on the server, piping to a local file, to list big remote files ?

Denis 2009-03-28 17:16:34

Answer 2

A:

Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

Compared to sysrqb's one-line regexp above, this is ... longer; who said

"You can parse some of the html all of the time, or all of the html some of the time, but not ..."

Denis 2009-04-03 08:49:42

Answer 3

+1 A:

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

It has CSS selectors as well so this sort of thing is trivial.

Wahnfrieden 2009-08-03 15:37:12

ansaurus

tags:

views:

answers:

URL tree walker in Python?

related questions