I am trying to get some content in html documents. Some of the documents have a table of contents that very nicely indicates where in the document the content I want to strip out is located. That is either the value or text_content of the tag are easily identifiable and point to what I need. For example I might have two anchor tags in the toc that have the following values
key=href value=#listofplaces text_content=Places we have visited
key=href value=#transport text_content=Ways we have traveled
and then in the body of the document
key=name value=listofplaces text_content=''
then there are lots of html elements, some tables, maybe some div tags, some unknown number of elements followed by the next anchor
key=name value=transport text_content=''
I was planning on using the output from a function to identify the beginning and end of the section I want to copy from the document. That is I was going to read the document and snip out the section between the anchor tags listofplaces and transport. I started thinking that LXML is so powerful that maybe the content I want is a branch of some sort that I just have not been able to figure out its identity.