I'm looking for best practices for performing strict (whitelist) validation/filtering of user-submitted HTML.
Main purpose is to filter out XSS and similar nasties that may be entered via web forms. Secondary purpose is to limit breakage of HTML content entered by non-technical users e.g. via WYSIWYG editor that has an HTML view.
I'm considering using HTML Purifier, or rolling my own by using an HTML DOM parser to go through a process like HTML(dirty)->DOM(dirty)->filter->DOM(clean)->HTML(clean).
Can you describe successes with these or any easier strategies that are also effective? Any pitfalls to watch out for?