views:

139

answers:

2

Hi.

I need HTML SAX (not DOM!) parser for PHP able to process even invalid HTML code. The reason i need it is to filter user entered HTML (remove all attributes and tags except allowed ones) and truncate HTML content to specified length.

Any ideas?

+1  A: 

SAX was made to process valid XML and fail on invalid markup. Processing invalid HTML markup requires keeping more state than SAX parsers typically keep.

I'm not aware of any SAX-like parser for HTML. Your best shot is to use to pass the HTML through tidy before and then use a XML parser, but this may defeat your purpose of using a SAX parser in the first place.

Artefacto
even after tidy pieces of HTML won't be valid. they're like this:`some comment with <b>bold text</b>, <i>italic text</i>.`it's invalid document for any XML parser. there's no root, but i don't want to mess around with wrapping content with some root element.
Daniel
@Daniel why do you need an event handler in the first place. If the HTML snippets are short, I see no compelling reason.
Artefacto
what event handler? 0o
Daniel
@Daniel Sorry, I meant an event driven API such as SAX.
Artefacto
oh, i've already got implementation using SAX parser, it's very efficient and simple, but its problem is SAX parser itself. it uses regexp to parse HTML :(
Daniel
@Daniel HTML parsing with regex => trouble
Artefacto
agree. thats why i'm looking for something better.
Daniel
A: 

Try to use HTML SAX Parser

murad
I've tried to use it, it can't handle embedded js or complex styles because its based on regexes.
Daniel
I used it to solve the problem that you are trying to solve. I filter user-generated content, cut JavaScript, tags, attributes.
murad