tags:

views:

232

answers:

4

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.

Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.

Any suggestions or tools? I am working in the Java environment.

Edit: I am not sure if the xml-declaration will be present in documents or not.

Edit: Let's assume that the encoding for all the xml docs is UTF-8.

+1  A: 

Since you're not sure the declaration will always be present, you can strip all declarations (a regex such as <\?xml version.*\?> can find these), prepend <doc-collection>, append </doc-collection>, such that the resultant string will be a valid xml document. In it, you can retrieve the separate documents using (for instance) the XPath query /doc-collection/*. If the combined file can be large enough that memory consumption becomes an issue, you may need to use a streaming parser such as Sax, but the principle remains the same.

In a similar scenario which I encountered, I simply read the concatenated document directly using an xml-parser: Although the concatenated file may not be a valid xml document, it is a valid xml fragment (barring the repeated declarations) - so, once you strip the declarations, if your parser supports parsing fragments, then you can also just read the result directly. All top-level elements will then be the root elements of the concatenated documents.

In short, if you strip all declarations, you'll have a valid xml fragment which is trivially parseable either directly or by surrounding it with some tag.

Eamon Nerbonne
+2  A: 

Don't split! Add one big tag around it! Then it becomes one XML file again:

<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>

Now, using /BIGTAG/SomeData would give you all the XML roots.


If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes. If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.

By merging them into one file, you've altered the encoding...


By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.

Workshop Alex
Processing Instructions starting with "xml" or "XML" are reserved for XML standards, so using them as "custom" PIs like this isn't really valid.
Joachim Sauer
At least Firefox's XML parser didn't like this...
Juha Syrjälä
I think this is largely right other than the processing instructions
Brian Agnew
This will not work, if all the xml docs are not using the same encoding.
Juha Syrjälä
You will need to strip out those <?xml? things>. Might be possible in the "dump xml"-stage.
Thorbjørn Ravn Andersen
This is why I suggested splitting instead - it's simpler, possibly faster, and not hard to get right.
Eamon Nerbonne
@Eamon, splitting is more difficult if those processing instructions aren't always included. Furthermore, those instructions don't make sense since they'll all use the same encoding as the big document. Java is quite good at regular expressions so with a simple expression you could delete all those instructions and the rest would become pure XML if you contain it in a supertag.
Workshop Alex
@JuHa S., the encoding is already invalid since everything is located in a single text file, thus it all uses the same encoding.
Workshop Alex
+2  A: 

As Eamon says, if you know the <?xml> thing will always be there, just break on that.

Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.

Jay
Why not just look for </someData>?
wds
And again, why not remove the processing instructions instead, adding everything else in a bigger tag? The processing instruction isn't useful any more since all files use the same encoding as the big document. With them gone, including a super-tag just turns it into valid XML again.
Workshop Alex
It depends on what the ultimate requirement is. The question was stated as, How do I split them?, so that's what I was trying to answer. Without knowing what the original poster is trying to do with the output, I don't know whether wrapping them all in one big tag is a viable solution or not. If it is, great, go for it. There might be other potential solutions in that direction. Like if the files all share a common top-level tag, maybe you could combine them all under a single such tag, i.e. strip out the start tags on all but the first and the end tags on all but the last.
Jay
I ended up breaking at starting root elements.
Juha Syrjälä
A: 

I don't have a Java answer, but here's how I solved this problem with C#.

I created a class named XmlFileStreams to scan the source document for the XML document declaration and break it up logically into multiple documents:

class XmlFileStreams {

    List<int> positions = new List<int>();
    byte[] bytes;

    public XmlFileStreams(string filename) {
        bytes = File.ReadAllBytes(filename);
        for (int pos = 0; pos < bytes.Length - 5; ++pos)
            if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l')
                positions.Add(pos);
        positions.Add(bytes.Length);
    }

    public IEnumerable<Stream> Streams {
        get {
            if (positions.Count > 1)
                for (int i = 0; i < positions.Count - 1; ++i)
                    yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]);
        }
    }

}

To use XmlFileStreams:

foreach (Stream stream in new XmlFileStreams(@"c:\tmp\test.xml").Streams) {
    using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) {
        // parse file using xr
    }
}

There are a couple of caveats.

  1. It reads the entire file into memory for processing. This could be a problem if the file is really big.
  2. It uses a simple brute force search to look for the XML document boundaries.
Ferruccio