views:

33

answers:

1

I am trying to remove comments from a list of elements that were obtained by using lxml

The best I have been able to do is:

no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))]

I am wondering if there is a more direct way?

I am going to add something based on Matthew's answer - he got me almost there the problem is that when the element are taken from the tree the comments lose some identity (I don't know how to describe it) so that it cannot be determined whether they are HtmlComment class objects using the isinstance() method

However, that method can be used when the elements are being iterated through on the tree

from lxml.html import HtmlComment
no_comments=[element for element in root.iter() if not isinstance(element,HtmlComment)

For those novices like me root is the base html element that holds all of the other elements in the tree there are a number of ways to get it. One is to open the file and iterate through it so instead of root.iter() in the above

html.fromstring(open(r'c:\temp\testlxml.htm').read()).iter()
+1  A: 

You can cut out the strings:

from lxml.html import HtmlComment # or similar
no_comments=[element for element in element_list if not isinstance(element, HtmlComment)]
Matthew Flaschen
Didn't work my list still included comments Humm, but it might work earlier the elements in element_list, if they are comments are the comments - does that make sense? An element that is a comment is <!-- COMMAND=ADD_BASECOLOR,"Black" -->, an element that is not a comment is <Element br at 12b9928>
PyNEwbie
But it does work here elements=[e for e in theTree.cssselect('text')[0].iter()) if not isinstance(e,HtmlComment)]
PyNEwbie