ansaurus

Question

Is the content between anchor tags (a) in html seen as a branch in lxml?

Answer 1

+1 A:

No, there is not a single branch between siblings. However, you can just iterate over their parent and extract (can be done in various ways, depending on how you already have handles for the anchor tags). Note the handling of text and tail to avoid losing data. Modifying example_doc to see the results may help you better understand this example code.

import lxml.etree

example_doc = """
  <root>
    <a name="listofplaces"/>
    text
    <sibling/>
    <sibling/>
    <a name="transport"/>
  </root>
"""
root = lxml.etree.XML(example_doc)

new_root = lxml.etree.Element("root")
it = iter(root)
for e in it:
  if e.tag == "a" and e.get("name") == "listofplaces":
    new_root.text = e.tail
    break
else:
  assert False, "TODO: handle tag not found"
for e in it:
  if e.tag == "a" and e.get("name") == "transport":
    break
  new_root.append(e)
else:
  assert False, "TODO: handle tag not found"

print lxml.etree.tostring(new_root)

Roger Pate 2010-03-07 17:40:36

@Roger thanks it is a great example.

PyNEwbie 2010-03-07 22:37:51

ansaurus

tags:

views:

answers:

Is the content between anchor tags (a) in html seen as a branch in lxml?

related questions