Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.
+1
A:
Try the .text_content()
method on an element, probably best after using lxml.html.clean
to get rid of unwanted content (script tags etc...). For example:
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
Steven
2010-10-20 08:23:56
I want to get rid of everything, not just unsafe tags
Timmy
2010-10-20 13:26:12
If you want to get rid of everything, why not just `text=''`? ;-) Seriously, `text_content()` WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text *inside* the element (but I assumed you were only interested in the "real" text, hence the cleanup first)
Steven
2010-10-20 14:09:57
was using clean_html( string ) which does differnet things
Timmy
2010-10-20 20:18:26