tags:

views:

394

answers:

4

Hi,

I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn't have any top nodes which are interdependent .Is there any tool available which readily does this for me ?

A: 

Not an Xml tool but Ultraedit could probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.

MrTelly
I need to split it
sameer karjatkar
+1  A: 

I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.

I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.

Other options worth considering:

  1. EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.

  2. VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.

  3. EmEditor

Cerebrus
Does the sample in the link provided do tag checking ?
sameer karjatkar
If you're asking about the CodeProject link, I think it inserts Root nodes at the beginning and end of each split file.
Cerebrus
Unfortunately it has crashed after 750 MB
sameer karjatkar
Did you try the text editors (manual splitting)?
Cerebrus
I can vouch for EmEditor's efficiency at editing huge files. Good editor, deserves to be better known; shame the free version was dropped.
bobince
Thanks, @bobince. I haven't had an opportunity to use it myself but have heard about its effectiveness.
Cerebrus
A: 

Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:

split_xml_15GB()
{
  int nObjectCount = 0, nFileCount = 0;
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "15GB.xml", MDF_READFILE );
  xmlInput.FindElem(); // root
  str sRootTag = xmlInput.GetTagName();
  xmlInput.IntoElem();
  while ( xmlInput.FindElem() )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( sRootTag );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 1000000 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

I posted a youtube video and article about this here:

http://www.firstobject.com/xml-splitter-script-video.htm

Ben Bryant
A: 

In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.

John Saunders