tags:

views:

40

answers:

1

I am trying to get some content in html documents. Some of the documents have a table of contents that very nicely indicates where in the document the content I want to strip out is located. That is either the value or text_content of the tag are easily identifiable and point to what I need. For example I might have two anchor tags in the toc that have the following values

key=href value=#listofplaces text_content=Places we have visited
key=href value=#transport text_content=Ways we have traveled

and then in the body of the document

key=name value=listofplaces text_content=''

then there are lots of html elements, some tables, maybe some div tags, some unknown number of elements followed by the next anchor

key=name value=transport text_content=''

I was planning on using the output from a function to identify the beginning and end of the section I want to copy from the document. That is I was going to read the document and snip out the section between the anchor tags listofplaces and transport. I started thinking that LXML is so powerful that maybe the content I want is a branch of some sort that I just have not been able to figure out its identity.

+1  A: 

No, there is not a single branch between siblings. However, you can just iterate over their parent and extract (can be done in various ways, depending on how you already have handles for the anchor tags). Note the handling of text and tail to avoid losing data. Modifying example_doc to see the results may help you better understand this example code.

import lxml.etree

example_doc = """
  <root>
    <a name="listofplaces"/>
    text
    <sibling/>
    <sibling/>
    <a name="transport"/>
  </root>
"""
root = lxml.etree.XML(example_doc)

new_root = lxml.etree.Element("root")
it = iter(root)
for e in it:
  if e.tag == "a" and e.get("name") == "listofplaces":
    new_root.text = e.tail
    break
else:
  assert False, "TODO: handle tag not found"
for e in it:
  if e.tag == "a" and e.get("name") == "transport":
    break
  new_root.append(e)
else:
  assert False, "TODO: handle tag not found"

print lxml.etree.tostring(new_root)
Roger Pate
@Roger thanks it is a great example.
PyNEwbie