tags:

views:

45

answers:

1

I'd like to remove certain tags from an XML document as part of a filtering process but I cannot otherwise modify the appearance or structure of the XML.

The input XML comes in as a string eg:

<?xml version="1.0" encoding="UTF-8"?>
<main>
    <mytag myattr="123"/>
    <mytag myattr="456"/>
</main>

and the output needs to remove mytag where the attribute value is, say, 456:

<?xml version="1.0" encoding="UTF-8"?>
<main>
    <mytag myattr="123"/>
</main>

A diff should show only the removed tags as differences between the input and output.

I've looked into SAX, StAX and JAXB but it doesn't look like it is possible to output XML in the same format as it was input with any of these APIs. They will instead form well structured XML with proper indentation and whitespace which will sometimes appear to show differences from the input.

My current method uses regular expressions but is not very robust as it doesn't consider all the possible ways of structuring the above XML. For example, to match the attribute value:

myAttr\s*=\s*"([^"]*)"

This works on the example above, but won't work given this XML tag:

<mytag myattr=
    123></mytag>

Are regular expressions really the best option in this situation?

+5  A: 

Don't use regular expressions to parse XML! You already know what happens when you try, and I have a spiel on why this is.

In your case you should use XSLT. An XSLT file to do what you want is very simple and easy to follow. It's basically the following:

<xsl:template match="mytag[@myattr=123]">
</xsl:template>
<xsl:template match="*|@*">
  <xsl:copy>
    <xsl:apply-templates select="*|@*" />
  </xsl:copy>
</xsl:template>

Which will copy any element as long as it's not mytag with attribute myattr=123.

I tested it on your example file and got the output you said you wanted.

Now, as for how you use XSLT with Java, looks like an entire book has been written on the subject. You can probably use whatever XML library is your favourite. I've never actually used XSLT with Java before so I can't tell you which library is easiest to use.

Welbog
I'll have a look at XSLT but will it preserve the structure of the input XML. The examples I gave were well formatted but imagine the sample XML was all on one line. Would the output also be all on one line?
Alex Spurling
@Alex Spurling: This *should* be completely irrelevant, XML is about data, not about serialization format. Why is it important to you?
Tomalak
Because he's using non-XML-aware diff tools.
Robert Rossney
I totally agree and I would normally hate to resort to regular expressions in this situation. However, requirements always like to conflict with best practices and in this case, we need to receive XML client, remove sensitive information and then forward it to another client who doesn't expect us to change the format of the XML. I've actually done a bit more experimenting with StAX and it appears it is possible to output and XML stream in the same format as it is input. I will create a new answer to this question if it works. It should be a lot nicer than my regex solution!
Alex Spurling