ansaurus

Question

Answer 1

+2 A:

outer = etree.tostring(string_elem, method='html')
inner = re.match("^[^>]+>(.*)<[^<]+$", outer).groups(1)[0]

cobbal 2009-11-29 07:42:09

I know about tostring actually, but that includes the string-tag itself.

miracle2k 2009-11-29 07:52:03

wouldn't be that hard to trim out manually, a simple regex would work

cobbal 2009-11-29 08:02:46

+1 for finding one of the rare cases where using a regular expression to process XML isn't a terrible, terrible idea.

Robert Rossney 2009-11-29 17:50:52

Thanks guys. I guess stripping out the surrounding tag would work well enough also.

miracle2k 2009-11-29 22:56:32

Answer 2

A:

Regardless of the language, relatively simple XSLT template would do the trick.

Something like defining patterns to tags you want to keep, converting to text others.

You can of course use a recursive function with a compliant DOM implementation (minidom maybe?) and process tags by hand.

(pseudocode)

def Function(tag):
   if tag.NodeType = "#text": return tag.innerText
   text=""
   if tag.ElementName in allowedTags:
       text="<%s>"%tag.ElementName
   text += [Function(subtag) for subtag in tag.childs]
   if tag.ElementName in allowedTags:
       text+="</%s>"%tag.ElementName
   return text

wizzard0 2009-11-29 08:04:38

Answer 3

A:

Not using parser, but just pure string manipulation

mystring="""
 <strings>
      <string>Bla <b>One &amp; Two</b> Foo</string>
 </strings>
"""
for s in mystring.split("</string>"):
    if "<string>" in s:
        i = s.index("<string>")
        print s[i+len("<string>"):].replace("&amp;","")

2009-11-29 08:39:39

This list of things wrong with this approach is not short: among others, it fails if there's an empty `<string/>` element, or if any `<string>` element contains attributes, or whitespace in its opening or closing tag, or if any text node contains character entities or CDATA.

Robert Rossney 2009-11-29 18:38:07

you are assuming too much!!. And that's a bad habit.

2009-11-30 00:09:57

Answer 4

+2 A:

There may be a better way of conditionally handling objects returned by the xpath() function, but I'm not sufficiently conversant with lxml to know what it is, so I had to write a function to return the text value of a node. But that said, this shows a general approach to the problem:

>>> from lxml import etree
>>> from StringIO import StringIO
>>> def node_text(n):
        try:
            return etree.tostring(n, method='html', with_tail=False)
        except TypeError:
            return str(n)

>>> f = StringIO('<strings><string>This is <b>not</b> how I plan to escape.</string></strings>')
>>> x = etree.parse(f)
>>> ''.join(node_text(n) for n in x.xpath('/strings/string/node()'))
'This is <b>not</b> how I plan to escape.'

Robert Rossney 2009-11-29 18:28:44

It turns out that instead of using node(), one could also child.iterdescendants(), but thanks for pointing me in the right direction.

miracle2k 2009-11-29 22:55:28

ansaurus

tags:

views:

answers:

Python: Extract HTML from an XML file

related questions