ansaurus

Question

Answer 1

+5 A:

I'd use XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:copy-of select="@*" />
            <xsl:apply-templates />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

That should do the trick.

In python you could use lxml (direct link to sample on homepage) to transform it.

For some tests, use xsltproc, sample:

xsltproc test.xsl  test.xml

where test.xsl is the file above and test.xml your XML file.

Johannes Weiß 2009-03-20 18:33:18

I know nothing about XSLT but if that does the job, it looks really cool ;-)

David Zaslavsky 2009-03-20 18:35:20

XSLT is really great when transforming XML preferably to XML. It is indeed a turing-complete functional programming language but normal programming is (at least in XSLT1.x) a bit of pain since function invocation types very long ;-)

Johannes Weiß 2009-03-20 18:41:04

Thanks, I will give it a try, from the first look seams like it should do the trick

2009-03-20 18:45:55

Answer 2

+2 A:

Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.

I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.

David Zaslavsky 2009-03-20 18:34:09

Answer 3

+7 A:

This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

foo = """<node1>
    <node2>
        <node3>foo  </node3>
    </node2>
</node1>"""

bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)

Results in:

<node1><node2><node3>foo  </node3></node2></node1>

Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:

parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)

Van Gale 2009-03-20 18:52:01

If CDATA is present then this method would html encode everything inside CDATA block, e.g. converting < into < etc.

2009-03-20 19:08:00

Works now with the changes to the line creating the parser.

2009-03-20 19:17:32

Answer 4

+4 A:

Pretty straightforward with BeautifulSoup.

This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>

It will correctly ignore comments and CDATA.

import BeautifulSoup

s = """
<node1>
    <node2>
        <node3>foo</node3>
    </node2>
    <node3>
      <!-- I'm a comment! Leave me be! -->
    </node3>
    <node4>
    <![CDATA[
      I'm CDATA!  Changing me would be bad!
    ]]>
    </node4>
</node1>
"""

soup = BeautifulSoup.BeautifulStoneSoup(s)

for t in soup.findAll(text=True):
   if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
      t.replaceWith(t.strip())

print soup

Triptych 2009-03-20 18:59:55

I don't believe this is quite right because it will strip valid whitespace at the end of contents. But, it reminded me that my snippet does the wrong thing with CDATA so thanks for that! :)

Van Gale 2009-03-20 19:06:56

Thanks! This does exactly what I wanted

2009-03-20 19:12:45

But that does CHANGE the document! It's not an equal XML document anymore...

Johannes Weiß 2009-03-20 19:33:21

@Johannes Weiß: exactly

Van Gale 2009-03-20 21:22:52

ansaurus

tags:

views:

answers:

Crunching xml with python

related questions