I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.
It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.
UPDATE
I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
version="1.0" pass-through="none" output-method="xml">
<stx:template match="element/child">
<stx:process-self group="copy" />
</stx:template>
<stx:group name="copy" pass-through="all">
</stx:group>
</stx:transform>
The pass-through="none"
at the stx:transform
configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template
matches the XPath element/child
(this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy"
is invoked on the current element. That group has pass-though="all"
, so the default templates copy their input and process child elements. When the element/child
element is ended, control is passed back to the template that invoked process-self
, and the following elements are ignored again. Until the template matches again.
The following is an example input file:
<root>
<child attribute="no-parent, so no copy">
</child>
<element id="id1">
<child attribute="value1">
text1<b>bold</b>
</child>
</element>
<element id="id2">
<child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" ></b>
</x:childX>
</child>
</element>
</root>
This is the corresponding output file:
<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
text1<b>bold</b>
</child><child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" />
</x:childX>
</child>
The unusual formatting is a result of skipping the text nodes containing newlines outside the child
elements.