views:

32

answers:

3

i have a huge database of scraped forum posts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of this, there are always stray <strike> <b> </strike> </div> </b> tags in the posts which will end up messing up the webpage format when i add say 15 forum posts.

for now i have just been appending all possible end tags to the post just so that it might catch any open tag...is there a better way to do this short of parsing through the text and trying to manually remove each open tag. for loooooong forum posts this is an expensive transaction for a web app.

+1  A: 

Have a look at HTML Tidy

There is a also a Python wrapper lib: µTidylib

Alternatively there is HTML Purifier

irishbuzz
utidylib doesn't seem to have been updated since 2004.
Simon Hibbs
A: 

Beautiful Soup does a decent job at HTML cleanup.

Simon Hibbs
A: 

Look at lxml also.

loevborg
Or both - http://codespeak.net/lxml/elementsoup.html
Simon Hibbs