views:

305

answers:

6

Hi everyone, I have a gigantic (4GB) XML file that I am currently breaking into chunks with linux "split" function (every 25,000 lines - not by bytes). This usually works great (I end up with about 50 files), except some of the data descriptions have line breaks, and so frequently the chunk files do not have the proper closing tags - and my parser chokes halfway through processing.

Example file: (note: normally each "listing" xml node is supposed to be on its own line)

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks 
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>

Then sometimes my split ends up like

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks ... 
EOF

So - I have been reading about "csplit" and it sounds like it might work to solve this issue. I cant seem to get the regular expression right...

Basically I want the same output of ~50ish files

Something like:

*csplit -k myfile.xml '/</listing>/' 25000 {50}

Any help would be great Thanks!

+2  A: 

You can't get a valid XML file this way. I would recommend that you write a java program using StaX, which, if you use the WoodStox implementation, will go really quite fast streaming the XML in and out.

bmargulies
Thanks for the suggestion - however, on my chunks I add in the 1st line and/or the last line to close the *lisings" main tag so each sub file is valid.I have been doing this for some time now and it works fine to add those lines after the split.
Fred
You are famiilar with Java, extended VTD-XML gives you the flexibility to split the document using XPath, see the example below http://snippets.dzone.com/posts/show/11269
vtd-xml-author
+1  A: 

Use perl:

perl -p -i -e 'unless(defined$fname){$fname="xx00";open$fh,">",$fname;}$size+=length;print$fh $_;if($size>%MAX% and m@</listing>@){$fname++;$size=0;open$fh,">",$fname;}'

Replace %MAX% with maximum size of one file in bytes.

ZyX
How is this better than csplit?
bmargulies
Would this work with CDATA sections? In general I don't think it's good idea to try to modify xml with non-xml tools; so using Perl (etc) based xml parser would make more sense IMO. Bit more work, but would actually work.
StaxMan
A: 

First of all, you use a slash inside the regexp. To be safe you might want to quote it so that it won't be confused with the end delimiter: /<\/listing>/.

However, in this case it would be more convenient to split on the start tag rather than end tag, since each chunk contains up to but not including the matching line. So you might try something like this:

csplit myfile.xml '/^<listing>/' '{*}'

Used the beginning-of-line anchor ^ there to make sure it only splits before lines where the start tag appears at the beginning of the line.

Jukka Matilainen
A: 
Fred
+1  A: 

I would recommend against trying to use regexps (or naive text matching) for any xml manipulation, including splitting. XML is tricky enough to deal with that parser should be used; and due to memory limitations, one that can do "streaming" (aka incremental / chunked) parsing. I am most familiar with Java, where you would use Stax (or SAX) parser and writer/generator to do this; most other languages have something similar. Or if input is regular enough, data binding tool (JAXB) that can bind subtrees.

Doing it right way may be bit more work, but would actually work, dealing with things xml can have (for example, CDATA sections can not be split; regexp solutions invariably have cases they wouldn't handle, until one has basically written a full xml parser).

StaxMan
A: 

Having run into the same requirement ( to split a big XML file on the closure of top level child elements but in chunks ), I don't think csplit can achieve this if it only works as described in it's man page.

To be able to do this it would need..

  1. The ability to group patterns and repeat a group, not just a single pattern
  2. The ability to have a pattern that captured but did not split off a new file

That would enable a group like

tail bigfile.xml -n-1 | head -n+1 | csplit - '{ 25000 /<\/end>/ }' {*} 

I see neither of these features described in it's man page (but think they would be useful additions).

Adrian