tags:

views:

30

answers:

3

hi, I want to accept to accept the html input from user and post it on my site also want to make sure that it don't create problem with my site template due to dirty html code.

I was using html purifier in the past but Html purifier is not working on one of my server. So I am searching for best alternative. Which is purely written in php. which can fix the dirty html code like

</div> it is dirty code as div is closed without opening. 
+3  A: 

You can try PHP Tidy, which is the Tidy library in PHP.

Vivin Paliath
I imagine it should. Looking at the installation page, it says that this module comes bundled with PHP >=5.
Vivin Paliath
A: 

I believe Tidy will help close your tags, but it isn't as comprehensive as HTML Purifier which can remove valid but unwanted tags or attributes (i.e. JavaScript onclick events, that kind of thing).

Be aware that Tidy requires libtidy to be installed on your server, so it's not just straight PHP.

I know Pádraic Brady has been working on an alternative to HTML Purifier for Zend Framework, though I think its just experimental code at this time

http://framework.zend.com/wiki/pages/viewpage.action?pageId=25002168

http://github.com/padraic/wibble

simonrjones
I tried it . but it has a lot of bugs.
Vivek Goel
shame. I'd recommend either try to get HTML Purifier working, or try Tidy.
simonrjones
+1  A: 

Simple solution without third-party libraries: create a DOMDocument and call loadHTML on it with your input. Surrounded the input with <html> and <body> tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.

Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:, and not potentially troublesome schemes like javascript:. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).

Finally, serialise the snippet's node again using saveHTML. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.

bobince