For URLs that show file trees, such as Pypi packages,
is there a small solid module to walk the URL tree and list it like ls -lR
?
I gather (correct me) that there's no standard encoding of file attributes,
link types, size, date ... in html <A
attributes
so building a solid URLtree module on shifting sands is tough.
But surely this wheel (Unix file tree -> html -> treewalk API -> ls -lR or find
)
has been done?
(There seem to be several spiders / web crawlers / scrapers out there, but they look ugly and ad hoc so far, despite BeautifulSoup for parsing).
views:
396answers:
3Apache servers are very common, and they have a relatively standard way of listing file directories.
Here's a simple enough script that does what you want, you should be able to make it do what you want.
Usage: python list_apache_dir.py
import sys
import urllib
import re
parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
# look for a link + a timestamp + a size ('-' for dir)
def list_apache_dir(url):
try:
html = urllib.urlopen(url).read()
except IOError, e:
print 'error fetching %s: %s' % (url, e)
return
if not url.endswith('/'):
url += '/'
files = parse_re.findall(html)
dirs = []
print url + ' :'
print '%4d file' % len(files) + 's' * (len(files) != 1)
for name, date, size in files:
if size.strip() == '-':
size = 'dir'
if name.endswith('/'):
dirs += [name]
print '%5s %s %s' % (size, date, name)
for dir in dirs:
print
list_apache_dir(url + dir)
for url in sys.argv[1:]:
print
list_apache_dir(url)
Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --
from BeautifulSoup import BeautifulSoup
def trow_cols( trow ):
""" soup.table( "tr" ) -> <td> strings like
[None, u'Name', u'Last modified', u'Size', u'Description']
"""
return [td.next.string for td in trow( "td" )]
def trow_headers( trow ):
""" soup.table( "tr" ) -> <th> table header strings like
[None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40 ', u'8.9K']
"""
return [th.next.string for th in trow( "th" )]
if __name__ == "__main__":
...
soup = BeautifulSoup( html )
if soup.table:
trows = soup.table( "tr" )
print "headers:", trow_headers( trows[0] )
for row in trows[1:]:
print trow_cols( row )
Compared to sysrqb's one-line regexp above, this is ... longer; who said
"You can parse some of the html all of the time, or all of the html some of the time, but not ..."
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
It has CSS selectors as well so this sort of thing is trivial.