How to parse not strict HTML documents indulgently?

tags:

html
parsing

views:

135

answers:

+1 Q:

How to parse not strict HTML documents indulgently?

hello again

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:

not self-closed single tags. for example: <br> or <hr>...
mismatched casing pairs: <td>...</TD>
attributes with no quotes marks: <span class=hilite>...</SPAN>
so on and so on... etc

suggest any suitable parser, please
thank you

+1 A:

If you're happy with Python, Beautiful Soup is just such a parser.

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

RichieHindle 2009-09-24 17:59:24

thank you. it would be a great case to learn python for me :)

Lyubomyr Shaydariv 2009-09-24 18:03:50

+2 A:

TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).

...TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Rich Seller 2009-09-24 18:00:43

wow! it's really promising! thank you :)

Lyubomyr Shaydariv 2009-09-24 18:06:13

+1 A:

Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot

hgimenez 2009-09-24 18:43:39

o great, thank you! :) so many answers! :)

Lyubomyr Shaydariv 2009-09-24 20:41:58

ansaurus

tags:

views:

answers:

How to parse not strict HTML documents indulgently?

related questions