views:

356

answers:

6

Hello

What XML-parser do you recommend for the following purpose:

The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

Removing attributes I don't need could save around 30%, maybe a bit more.

First part for optimizing the second part: Is there any good tool (command line linux and windows if possible) to easily remove unused attributes in certain tags? I know that XSLT could be used. Or are there any easy alternatives? Also, I could split it into three files, one for each tag to gain speed for later parsing... Speed is not too important for this preparation of the data, of course it would be nice when it took rather minutes than hours.

Second part: Once I have the data prepared, be it shortened or not, I should be able to search for the ID-attribute I was mentioning, this being time-critical.

Estimations using wc -l tell me that there are around 3M N-tags and around 418K W-tags. The latter ones can contain up to approximately 20 subtags each. W-Tags also contain some, but they would be stripped away.

"All I have to do" is navigating between tags containing certain id-attributes. Some tags have references to other id's, therefore giving me a tree, maybe even a graph. The original data is big (as mentioned), but the resultset shouldn't be too big as I only have to pick out certain elements.

Now the question: What XML parsing library should I use for this kind of processing? I would use Java 6 in a first instance, with having in mind to be porting it to BlackBerry.

Might it be useful to just create a flat file indexing the id's and pointing to an offset in the file? Is it even necessary to do the optimizations mentioned in the upper part? Or are there parser known to be quite as fast with the original data?

Little note: To test, I took the id being on the very last line on the file and searching for the id using grep. This took around a minute on a Core 2 Duo.

What happens if the file grows even bigger, let's say 5 GB?

I appreciate any notice or recommendation. Thank you all very much in advance and regards

+1  A: 

I'm using XMLStarlet ( http://xmlstar.sourceforge.net/ ) for working with huge XML files. There are versions for both linux and windows.

Kirzilla
+1 for the good tip. Couldn't take a deep look at it but at a glance it looks quite nice and simple if you know XPath - which i anyway will have to deal with
Atmocreations
not a too good tip for big data :P. I've tried to delete some attributes in this big file on Windows 7 and somewhere it told me it cannot continue due to out-of-memory.
Atmocreations
Thank you for information. Then, probably, splitting the file will be the best deal. Please, could you comment here your futher actions. I'm also really interested in processing huge xml files. Thank you.
Kirzilla
i will document my different steps and their success, yes. That's the only reason why I haven't yet accepted an answer
Atmocreations
A: 

Large XML files and Java heap space are a vexed issue. StAX works on big files - it certainly handles 1GB without batting an eyelid. There's a useful article on the subject of using StAx here: XML.com which got me up and running with it in about 20 minutes.

Szyzygy
+1  A: 

What XML-parser do you recommend for the following purpose: The XML-file (formatted, containing whitespaces) is around 800 MB.

Perhaps you should take a look at VTD-XML: http://en.wikipedia.org/wiki/VTD-XML (see http://sourceforge.net/projects/vtd-xml/ for download)

It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

I know it's blasphemy but have you considered awk or grep to preprocess? I mean, I know you can't actually parse xml and detect errors in nested structures like XML with that, but perhaps your XML is in such a form that it might just happens to be possible?

I know that XSLT could be used. Or are there any easy alternatives?

As far as I know XSLT processors operate on a DOM tree of the source document...so they'd need to parse and load the entire document into memory...probably not a good idea for a document this large (or perhaps you have enough memory for that?) There is something called streaming XSLT but I think the technique is quite young and there aren't many implementations around, none free AFAIK so you could try.

Roland Bouman
XSLT doesn't do anything with the DOM except when it's in a browser. It only creates a new document.
Rob
@Rob can you clarify your comment please? If XSLT does not operate on the DOM tree how exactly does it work?Thanks
e4c5
Rob, from the XSLT spec: "A transformation expressed in XSLT describes rules for transforming a source tree into a result tree." Now, I realize that implementations are free to do this in any which way they like, but AFAIK, they do it by first parsing the source to a tree. Whether that parse tree does or does not conform to a full-blown DOM, doesn't really matter - with these file sizes it's going to take a huge glob of memory. Hope I clarified that.
Roland Bouman
no blasphemy, Roland, good idea to use grep etc.I guess I will write something on my own because I anyway need something to distribute so the users can adapt their own piece from source data.
Atmocreations
@Rob: DOM = in-memory repersentation and API for XML. org.w3c.dom.*. Alternatives: SAX, VTD-XML. It's NOT referring to the element tree of the browser!
helios
+4  A: 

As Bouman has pointed out, treating this as pure text processing will give you the best possible speed.

To process this as XML the only practical way is to use a SAX parser. The Java APIs build in SAX parser is perfectly capable of handling this so there is no need to install any third party libraries.

e4c5
Yeah +1, SAX would guarantee a single pass through the document. SO if you are sure you can do all the bookkeeping and manipulation you need in a single pass, you could certainly give that a try.
Roland Bouman
Although the problem is not completely solved yet, this is the answer that helped me the most. Thanks...
Atmocreations
glad I could help. If you any additional input will be happy to help
e4c5
A: 

xslt tends to be comparatively quite fast even for large files. For large files, the trick is not creating the DOM first. Use a URL Source or a stream source to pass to the transformer.

To strip the empty nodes and unwanted attributes start with the Identity Transform template and filter them out. Then use XPATH to search for your required tags.

You could also try a bunch of variations:

  • Split the large XML files into smaller ones and still preserve their composition using the XML-Include. It is very much similar to splitting large source files into smaller ones and using the include "x.h" kind of concept. This way, you may not have to deal with large files.

  • When you run your XML through the Identity Transform, use it to assign a UNID for each node of interest using the generated-id() function.

  • Build a front-end database table for searching. Use the above generated UNID to quickly pinpoint the location of the data in a file.

srini.venigalla
Here is the identity transform. In its simplest form, it just makes a copy of the source file. But you can tweak it to do some amazing things very fast.http://en.wikipedia.org/wiki/Identity_transform.For example, you can use it to split the files, assign UNIDs, strip unwanted nodes/attribues, add markers etc.
srini.venigalla
srini, it is nice and dandy that you can pass a stream to the XSLT processor, but doesn't it internally create a parse tree/DOM anyway? AFAIK this is what they have to todo - even a 'streaming' XSLT implementation, or actually, anything that relies on XPath (XSLT, XQuery) would still need to lazily parse the document, which for many XPAth expressions still means reading the entire document.
Roland Bouman
Roland, agreed, but i would like to leave that step also to the transformer. The reason being, DOM as we see is a heavy structure, as opposed to what is needed for mere traversal. A good XSLT may use a lighter weight tree.
srini.venigalla
srini: ok, I see what you mean now. Thanks for pointing it out.
Roland Bouman
A: 

"I could split it into three files"

Try XmlSplit. it is a commandline program with options for specifying where to split by element, attribute, etc. Google and you should find it. Very fast too.

bill seacham