Simple solution without third-party libraries: create a DOMDocument
and call loadHTML
on it with your input. Surrounded the input with <html>
and <body>
tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.
Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:
, and not potentially troublesome schemes like javascript:
. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).
Finally, serialise the snippet's node again using saveHTML
. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.