I'd like to remove certain tags from an XML document as part of a filtering process but I cannot otherwise modify the appearance or structure of the XML.
The input XML comes in as a string eg:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<mytag myattr="123"/>
<mytag myattr="456"/>
</main>
and the output needs to remove mytag
where the attribute value is, say, 456:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<mytag myattr="123"/>
</main>
A diff should show only the removed tags as differences between the input and output.
I've looked into SAX, StAX and JAXB but it doesn't look like it is possible to output XML in the same format as it was input with any of these APIs. They will instead form well structured XML with proper indentation and whitespace which will sometimes appear to show differences from the input.
My current method uses regular expressions but is not very robust as it doesn't consider all the possible ways of structuring the above XML. For example, to match the attribute value:
myAttr\s*=\s*"([^"]*)"
This works on the example above, but won't work given this XML tag:
<mytag myattr=
123></mytag>
Are regular expressions really the best option in this situation?