ansaurus

Question

HTML parser in Python

Answer 1

+3 A:

Try:

import HTMLParser

Koh Wei Jie 2008-09-16 10:51:40

Answer 2

+1 A:

There's a link to an example on the bottom of the page.

Vytautas Shaltenis 2008-09-16 10:52:39

Answer 3

+12 A:

You probably really want BeautifulSoup, check the link for an example.

But in any case

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.feed('<html></html>')
>>> h.get_starttag_text()
'<html>'
>>> h.close()

Vinko Vrsalovic 2008-09-16 10:54:05

Answer 4

+2 A:

I would recommend using Beautiful Soup module instead and it has good documentation.

Swaroop C H 2008-09-16 10:54:21

I'm gonna give it a whirl, thanks for the suggestion

Teifion 2008-09-16 10:56:27

Answer 5

+1 A:

For real world HTML processing I'd recommend BeautifulSoup. It is great and takes away much of the pain. Installation is easy.

Antti Rasinen 2008-09-16 10:55:20

Answer 6

+1 A:

You should also look at html5lib for Python as it tries to parse HTML in a way that very much resembles what web browsers do, especially when dealing with invalid HTML (which is more than 90% of today's web).

Alexey Feldgendler 2008-09-16 12:14:04

Please add a link to html5lib. Thank you.

Cristian Ciupitu 2008-09-17 01:47:41

Answer 7

+2 A:

I don't recommend BeautifulSoup if you want speed. lxml is much, much faster, and you can fall back in lxml's BS soupparser if the default parser doesn't work.

Koh Wei Jie 2008-09-16 13:21:55

I agree, BeautifulSoup is only useful when parsing a handful of files, there are too many memoryleaks.

DrDee 2010-09-16 15:51:49

Answer 8

+2 A:

You may be interested in lxml. It is a separate package and has C components, but is the fastest. It has also very nice API, allowing you to easily list links in HTML documents, or list forms, sanitize HTML, and more. It also has capabilities to parse not well-formed HTML (it's configurable).

phjr 2008-09-17 11:19:11

ansaurus

tags:

views:

answers:

HTML parser in Python

related questions