views:

63

answers:

1

Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.

+1  A: 

Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example:

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()
Steven
I want to get rid of everything, not just unsafe tags
Timmy
If you want to get rid of everything, why not just `text=''`? ;-) Seriously, `text_content()` WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text *inside* the element (but I assumed you were only interested in the "real" text, hence the cleanup first)
Steven
was using clean_html( string ) which does differnet things
Timmy