tags:

views:

119

answers:

3

I have large XML files of 100s of MB.

Are there any utilities that can parse XML files and escape special charaters in strings without opening the entire file into memory at once?

Thanks

+2  A: 

In Java, don't use the DOM. Use SAX or StaX. If not in Java, you can still use SAX either with MSXML or with Expat.

bmargulies
Or libxml anywhere else: http://xmlsoft.org/
John Paulett
vtd-xml or extended VTD-XML
vtd-xml-author
A: 

The following c++ program copies a file byte by byte, and it uses very little memory (which makes it a little bit slow). You can improve the performance by not flushing to the outfile that often.

// copy a file using associated buffer's members
#include <fstream>
using namespace std;

int main () {
    char ch;
    ifstream infile;
    ofstream outfile;

    infile.open ("original.xml",std::ifstream::binary);
    outfile.open ("copy.xml",std::ofstream::binary);

    while ( !infile.eof() )
    {
        infile >> ch;
        outfile << ch;
        outfile.flush();
    }

    outfile.close();
    infile.close();

    return 0;
}

If you want a unix tool, I guess you could use sed.

Styggentorsken
Thanks, I'll see if I can hack something together with this.
Grym
sed is not the appropriate tool for this..
This is just a file copy program, not an XML parser.
Jim Ferrans
I used this, just detected whenever a special character was in a string and I escaped it. Only problem is now I need to get it to work in unicode.
Grym
+1  A: 

SAX and StAX may work if the stuff you do is very simple, otherwise, VTD-XML is the best bet

Introduction to VTD-XML

vtd-xml-author
+1 for the interesting reference.
Jim Ferrans
Why not give an example of using VTD-XML to solve the problem?
John Saunders
did you see the reference to an introductory article?
vtd-xml-author