tags:

views:

609

answers:

5

This XML file contained archived news stories for all of last year. I was asked to sort these stories by story categor[y|ies] into new XML files.

big_story_export.xml

turns into

lifestyles.xml
food.xml
nascar.xml

...and so on.

I got the job done using a one-off python script, however, I originally attempted this using XSLT. This resulted in frustration as my XPATH selections were crapping the bed. Test files were transformed perfectly, but putting the big file up against my style sheet resulted in ...nothing.

What strategies do you recommend for ensuring that files like this will run through XSLT? This was handed to me by a vendor, so imagine that I don't have a lot of leverage when it comes to defining the structure of this file.

If you guys want code samples, I'll put some together.

If anything, I'd be satisfied with some tips for making XML+XSLT work together smoothly.


@Sklivvz

I was using python's libxml2 & libxslt to process this. I'm looking into xsltproc now.

It seems like a good tool for these one-off situations. Thanks!


@diomidis-spinellis

It's well-formed, though (as mentioned) I don't have faculties to discover it's validity.

As for writing a Schema, I like the idea.

The amount of time I invest in getting this one file validated would be impractical if it were a one-time thing, though I foresee having to handle more files like this from our vendor.

Writing a schema (and submitting it to the vendor) would be an excellent long-term strategy for managing XML funk like this. Thanks!

+2  A: 

What language/parser were you using?
For large files I try to use Unix command line tools.
They are usually much, much more efficient than other solutions and don't "crap out" on large files.

Try using xsltproc

Sklivvz
I second the recommendation for xsltproc. It is worth a try, since the Python XSLT doesn't handle it.
DGentry
+2  A: 

This sounds like a bug in the large XML file or the XSLT processor. There are two things you should check on your file.

  1. Is the file well-formed XML? That is, are all tags and attributes properly terminated and matched? An XML processor, like xmlstarlet, can tell you that.
  2. Does the file contain valid XML? For this you need a schema and an XML validator (xmlstarlet can do this trick as well). I suggest you invest some effort to write the schema definition of your file. It will simplify a lot your debugging, because you can then easily pinpoint the exact source of problems you may be having.

If the file is well-formed and valid, but the XSLT processor still refuses to give you the results you would expect, you can be sure that the problem lies in the processor, and you should try a different one.

Diomidis Spinellis
+2  A: 

Can I recommend Saxon XSLT processor - I know for a fact it can handle large files, provided you give the Java JVM enough memory.

Another thing is that there may be optimisations n your XSLT that could help, but its hard to make blanket statements about things like that.

samjudson
+4  A: 

The problem with using XSLT to process arbitrarily large XML documents is that XSLT processing begins by parsing the input document into a source tree. This tree gets parsed into memory. This means that eventually you'll encounter an input document large enough to cause problems even if you're using a robust XSLT processor like Saxon and you have plenty of virtual memory. (It may still work, but it'll be slow.)

Another reason not to use XSLT for this is that you're producing multiple output documents, which (based on what you've said so far) means you're making multiple passes over your input document.

It may (depending on a lot of factors about your situation that I don't know about) be better to take a SAX-based approach instead of using XSLT. Using a SAX processor, you may be able to write a method that makes a single, forward-only pass through the source document, parsing it as it goes, and writes all of the output documents as it encounters the elements that contain them.

Robert Rossney
Would upvote twice if i could :)
Constantin
A: 

Check out Apache's Xalan C++. In my experience, where others (including Saxon) have failed on "large" XML files (>600 MB), this was able to run with memory to spare.

fatcat1111