tags:

views:

39

answers:

2

How would it be possible to do the following:

  1. Scan through an html page (preferably through a whole domain (www.python.org) and extract all

h1 h2 ...hn Tags

and write all Headings to a file. In the correct order:

Start with h1 Than h2

until we reach the next h1

+2  A: 

Use BeautifulSoup or PyQuery.

gruszczy
+1  A: 

Given the requirement to scan a whole website, you might want to look into pycurl to grab the files to scrape. Be careful not to hit the site with the equivalent of a DoS attack though.

mavnn
Thanks to both of you. Nice, very interesting! How can I avoid being seen as a potential threat?
MacPython
Just limit how hard and often you hit the site. Of course, if you control the site just pick a quiet moment and hammer it for a while.There aren't any hard or fast guidelines: what a site can cope with/is considered acceptable will vary enormously.
mavnn