A lightweight XML parser efficient for large files?

views:

1257

answers:

+5 Q:

A lightweight XML parser efficient for large files?

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.

Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint? The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.

I know about Xerces, but its sheer size of over 50mb gives me shivers.

Thanks!

+7 A:

If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.

The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.

Tony Miller 2009-06-17 11:59:15

Thanks for the answer. Is LibXML lightweight? How many kbytes does it add to the executable?

Alex Jenter 2009-06-17 12:06:33

If you're using a dynamic library (UNIX shared lib / Windows DLL), then the answer is "none". Just a quick check on my Linux box shows that the shared lib is 1.2M and the static library (to be used in compiling in to programs) is 1.5M. So if you did a static compile you'd be adding 1.5M-ish to your exe.

Tony Miller 2009-06-17 12:12:48

My whole .exe is around 350Kb, so I guess I'll be willing to find something more lightweight.. but thanks anyway

Alex Jenter 2009-06-17 12:16:08

If you're truly worried about size, try Expat at http://expat.sourceforge.net/ It's shared library size on my Linux box is 133K. I'm guessing that a statically compiled .a into your code would be that much or so.

Tony Miller 2009-06-17 12:27:54

+1 A:

If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.

See also the SAX2 interface in libxml

Pierre 2009-06-17 12:01:13

+2 A:

I like ExPat
http://expat.sourceforge.net/

It is C based but there are several C++ wrappers around to help.

Martin York 2009-06-17 17:01:23

+2 A:

http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)

I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.

I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.

The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.

Dolphin 2009-06-17 18:37:27

As far as I can tell it doesn't support Unicode?

Alex Jenter 2009-06-18 02:54:00

It parses a character at a time and uses an int for the character. For element an attribute names, it has a rather restrictive definition of what a valid identifier is (ascii basically) but it probably wouldn't take much to change it. It comes with a project that does a parse/serialize test, so it is pretty easy to run it across some representative data to try it out.

Dolphin 2009-06-18 03:12:01

Thanks for the answer, I'll look into it

Alex Jenter 2009-06-22 12:27:51

I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.

Ira Baxter 2009-09-04 03:46:32

firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.

The easiest way to try it out is with a script in the free firstobject XML editor such as this:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.

Ben Bryant 2009-09-28 17:03:19

+1 A:

RapidXML is quite a fast parser for XML written in C++.

dtw 2010-01-23 21:44:14

ansaurus

tags:

views:

answers:

A lightweight XML parser efficient for large files?

related questions