tags:

views:

4381

answers:

11

I would like to use a language that I am familiar with - Java, C#, Ruby, PHP, C/C++, although examples in any language or pseudocode are more than welcome.

What is the best way of splitting a large XML document into smaller sections that are still valid XML? For my purposes, I need to split them into roughly thirds or fourths, but for the sake of providing examples, splitting them into n components would be good.

+3  A: 

Well of course you can always extract the top-level elements (whether this is the granularity you want is up to you). In C#, you'd use the XmlDocument class. For example, if your XML file looked something like this:

<Document>
  <Piece>
     Some text
  </Piece>
  <Piece>
     Some other text
  </Piece>
</Document>

then you'd use code like this to extract all of the Pieces:

XmlDocument doc = new XmlDocument();
doc.Load("<path to xml file>");
XmlNodeList nl = doc.GetElementsByTagName("Piece");
foreach (XmlNode n in nl)
{
    // Do something with each Piece node
}

Once you've got the nodes, you can do something with them in your code, or you can transfer the entire text of the node to its own XML document and act on that as if it were an independent piece of XML (including saving it back to disk, etc).

DannySmurf
+1  A: 

This is more of a comment than an answer, but wouldn't:

XmlDocument doc = new XmlDocument();
doc.Load("path");

Read the entire file at once? Just thought I should raise the point since from the look of Thomas' question, he is concerned about reading large files and wants to break the process down..

Rob Cooper
A: 

@robcthegeek: The files are too large for the processing that needs to be done with them on the server side, but I think I should be able to read them all at once and then break them up.

Thomas Owens
+3  A: 

As DannySmurf touches on here, it is all about the structure of the xml document.
If you only two huge "top level" tags, it will be extremely hard to be able to split it in a way that makes it possible to both merge it back together and read it piece by piece as valid xml.

Given a document with a lot of seperate pieces like the ones in DannySmurfs example, it should be fairly easy.
Some rough code in Pseudo C# :

int nrOfPieces = 5;
XmlDocument xmlOriginal = some input parameter..

// construct the list we need, and fill it with XmlDocuments..
var xmlList = new List<XmlDocument>();
for (int i = 0; i < nrOfPieces ; i++)
{
    var xmlDoc = new XmlDocument();
    xmlDoc.ChildNodes.Add(new XmlNode(xmlOriginal.FistNode.Name));
    xmlList.Add(xmlDoc);
}

var nodeList = xmlOriginal.GetElementsByTagName("Piece")M
// Copy the nodes from the original into the pieces..
for (int i = 0; i < nodeList .Count; i++)
{
    var xmlDoc = xmlList[i % nrOfPieces];
    var nodeToCopy = nodeList[i].Clone();
    xmlDoc.FirstNode.ChildNodes.Add(nodeToCopy);
}

This should give you n docs with correct xml and the possibility to merge them back together.
But again, it depends on the xml file.

Lars Mæhlum
+1  A: 

It would read the entire file at once. In my experience, though, if you're just reading the file, doing some processing (i.e., breaking it up) and then continuing on with your work, the XmlDocument is going to go through it's create/read/collect cycle so quickly that it likely won't matter.

Of course, that depends on what a "large" file is. If it's a 30 MB XML file (which I would consider large for an XML file), it probably won't make any difference. If it's a 500 MB XML file, using XmlDocument will become extremely problematic on systems without a significant amount of RAM (in that case, however, I'd argue that the time to manually pick through the file with a XmlReader would be the more significant impediment).

DannySmurf
A: 

It looks like you're working with C# and .NET 3.5. I have come across some posts that suggest using a yield type of algorithm on a file stream with an XmlReader.

Here's a couple blog posts to get you started down the path:

Code Monkey
A: 

Not sure what type of processing you're doing, but for very large XML, I've always been a fan of event-based processing. Maybe it's my Java background, but I really do like SAX. You need to do your own state management, but once you get past that, it's a very efficient method of parsing XML.

http://saxdotnet.sourceforge.net/

A: 

I'm going to go with youphoric on this one. For very large files SAX (or any other streaming parser) will be a great help in processing. Using DOM you can collect just top level nodes, but you still have to parse the entire document to do it...using a streaming parser and event-based processing lets you "skip" the nodes you aren't interested in; makes the processing faster.

A: 

If you are not completely allergic to Perl, then XML::Twig comes with a tool named xml_split that can split a document, producing well-formed XML section. You can split on a level of the tree, by size or on an XPath expression.

mirod
+5  A: 

Parsing XML documents using DOM doesn't scale.

This Groovy-script is using StAX (Streaming API for XML) to split an XML document between the top-level elements (that shares the same QName as the first child of the root-document). It's pretty fast, handles arbitrary large documents and is very useful when you want to split a large batch-file into smaller pieces.

Requires Groovy on Java 6 or a StAX API and implementation such as Woodstox in the CLASSPATH

import javax.xml.stream.*

pieces = 5
input = "input.xml"
output = "output_%04d.xml"
eventFactory = XMLEventFactory.newInstance()
fileNumber = elementCount = 0

def createEventReader() {
    reader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(input))
    start = reader.next()
    root = reader.nextTag()
    firstChild = reader.nextTag()
    return reader
}

def createNextEventWriter () {
    println "Writing to '${filename = String.format(output, ++fileNumber)}'"
    writer = XMLOutputFactory.newInstance().createXMLEventWriter(new FileOutputStream(filename), start.characterEncodingScheme)
    writer.add(start)
    writer.add(root)
    return writer
}

elements = createEventReader().findAll { it.startElement && it.name == firstChild.name }.size()
println "Splitting ${elements} <${firstChild.name.localPart}> elements into ${pieces} pieces"
chunkSize = elements / pieces
writer = createNextEventWriter()
writer.add(firstChild)
createEventReader().each { 
    if (it.startElement && it.name == firstChild.name) {
        if (++elementCount > chunkSize) {
            writer.add(eventFactory.createEndDocument())
            writer.flush()
            writer = createNextEventWriter()
            elementCount = 0
        }
    }
    writer.add(it)
}
Jonas Engman
A: 

I did a YouTube video showing how to split XML files with foxe (the free XML editor from Firstobject) using only a small amount of memory regardless of the size of the input and output files.

The memory usage for this CMarkup XML reader (pull parser) and XML writer solution depends on the size of the subdocuments that are individually transferred from the input file to the output files, or the minimum block size of 16 KB.

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "50MB.xml", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//ACT") )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "root" );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 5 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}
Ben Bryant