views:

165

answers:

1

I'm writing code where I retrieve XML from a web api, then parse that XML using Groovy. Unfortunately, it seems that both XmlParser and XmlSlurper for Groovy strip newline characters from the attributes of nodes when .text() is called.

How can I get at the text of the attribute including the newlines?

Sample code:

def xmltest = '''
<snippet>
   <preSnippet att1="testatt1" code="This is line 1
   This is line 2
   This is line 3" >
      <lines count="10" />
   </preSnippet>
</snippet>'''

def parsed = new XmlParser().parseText( xmltest )
println "Parsed"
parsed.preSnippet.each { pre ->
       println pre.attribute('code');
}


def slurped = new XmlSlurper().parseText( xmltest )
println "Slurped"
slurped.children().each { preSnip ->
   println [email protected]()
   }

the output of which is:

Parsed
This is line 1    This is line 2    This is line 3
Slurped
This is line 1    This is line 2    This is line 3

Ok, I was able to convert the text before I parsed it, then re-convert after, a la:

def newxml = xmltest.replaceAll( /code="[^"]*/ ) {
   return it.replaceAll( /\n/, "~#~" )
}
def parsed = new XmlParser().parseText( xmltest )
def code = pre.attribute('code').replaceAll( "~#~", "\n" )

Not my favorite hack, but it'll do until they fix their XML output.

+1  A: 

New lines are not supported in attributes - this is from the XML specification. They end up 'normalised' which in this case, means they get replaced with a space character. See this section of the spec: http://www.w3.org/TR/REC-xml/#AVNormalize

My team had this problem and our solution was to switch to using elements rather than attributes.

hbunny
That's good to know, and I've notified those who generate the XML that they're doing it wrong... any chance you've got a way to replace just the carriage returns in attributes of an XML file with another char string I can put BACK to carriage returns when I read the text? It's a hack that would fix this while I wait for the real XML change.
Bill James
You could try playing around with character references and, if that doesn't work, custom replaceable sequences that you handle yourself.
hbunny