views:

79

answers:

2

currently I have a spider written in Java that logs into a supplier website and spiders the website. (using htmlunit)

It keeps the session (cookie) and even lets me enable/disable javascript etc.

I also use htmlparser (java) to help parse the html and extract the relevant information.

Does python have something similar to do this?

+4  A: 

Python has urllib2 to crawl pages, which supports password authentication and cookies.

There is also a HTMLParser for extracting html, but some people prefer the more feature-full BeatifulSoup.

Stephen
very cool, i'm really getting excited by all things python!
Blankman
What's _really_ cool is that it'll be about one-millionth of the amount of Java code you had to write ;)
Stephen
indeed, that is exactly what I meant.
Blankman
+1 for mentioning `BeautifulSoup`
Aviral Dasgupta
before using BeautifulSoup check out lxml, its a **much** better/faster general parser, BeautifulSoup is good for pocket cases and munged HTML, it can also be embedded in the lxml API as well. If you go the BS route :) get version 3.0, version 3.1 is absolute junk.
ebt
A: 

Scrapy API uses urllib2 plus adds wires up some different parsers and helper routines.

ebt