views:

69

answers:

0

I have been having fun manipulating html with lxml. Now I want to do some manipulation of the actual file, after finding a particular element that meets my needs I want to know if it is possible to retrieve the source of the element.

I jumped up and down in my chair after seeing sourceline as a method of my element but that did not give me what I wanted.

some_element.sourceline

Near as I can figure, sourceline can only be used when the htm source is a file of lists so you get the line number.

I better add that I generated my elements by

theTree=html.fromstring(open(myFileRef).read())

the_elements=[e  for e in theTree.iter()]

To be clear, I am getting None as the value for some_element.sourceline - I tested this for all 27,000 elements in my tree

One thing I am imagining doing is using the html source in an expression to find that particular place in the document, maybe to snip something out. I can't rely on the text of an element because the text is not necessarily unique.

One solution that was posted but taken down was to use sourceline but even after reading in my file as a list I was not able to get any value other than None for sourceline. I am going to post another question to see if someone has an example using sourceline

I just tried and discarded html.tostring(myelement) as it converts at least some encodings automatically (I am probably not phrasing that correctly) Here is an example:

Snip of the html source

<b>  KEY 1A.&nbsp;&nbsp;&nbsp;&nbsp;REGIONAL PRODUCTION    <br>    </b>

html.tostring(the_element,method='html')

Clearly I am not getting the original, unvarnished source.

'<b>  KEY 1A.&#160;&#160;&#160;&#160;REGIONAL PRODUCTION    <br></b>'