ansaurus

Question

How to use linux csplit to chop up massive XML file?

Answer 1

+2 A:

You can't get a valid XML file this way. I would recommend that you write a java program using StaX, which, if you use the WoodStox implementation, will go really quite fast streaming the XML in and out.

bmargulies 2010-05-13 22:31:17

Thanks for the suggestion - however, on my chunks I add in the 1st line and/or the last line to close the *lisings" main tag so each sub file is valid.I have been doing this for some time now and it works fine to add those lines after the split.

Fred 2010-05-13 22:38:24

You are famiilar with Java, extended VTD-XML gives you the flexibility to split the document using XPath, see the example below http://snippets.dzone.com/posts/show/11269

vtd-xml-author 2010-05-14 02:07:39

Answer 2

+1 A:

Use perl:

perl -p -i -e 'unless(defined$fname){$fname="xx00";open$fh,">",$fname;}$size+=length;print$fh $_;if($size>%MAX% and m@</listing>@){$fname++;$size=0;open$fh,">",$fname;}'

Replace %MAX% with maximum size of one file in bytes.

ZyX 2010-05-13 22:59:24

How is this better than csplit?

bmargulies 2010-05-13 23:03:47

Would this work with CDATA sections? In general I don't think it's good idea to try to modify xml with non-xml tools; so using Perl (etc) based xml parser would make more sense IMO. Bit more work, but would actually work.

StaxMan 2010-07-10 16:08:23

Answer 3

A:

First of all, you use a slash inside the regexp. To be safe you might want to quote it so that it won't be confused with the end delimiter: /<\/listing>/.

However, in this case it would be more convenient to split on the start tag rather than end tag, since each chunk contains up to but not including the matching line. So you might try something like this:

csplit myfile.xml '/^<listing>/' '{*}'

Used the beginning-of-line anchor ^ there to make sure it only splits before lines where the start tag appears at the beginning of the line.

Jukka Matilainen 2010-05-15 07:33:09

Answer 4

A:

Fred 2010-05-15 22:37:26

Answer 5

+1 A:

I would recommend against trying to use regexps (or naive text matching) for any xml manipulation, including splitting. XML is tricky enough to deal with that parser should be used; and due to memory limitations, one that can do "streaming" (aka incremental / chunked) parsing. I am most familiar with Java, where you would use Stax (or SAX) parser and writer/generator to do this; most other languages have something similar. Or if input is regular enough, data binding tool (JAXB) that can bind subtrees.

Doing it right way may be bit more work, but would actually work, dealing with things xml can have (for example, CDATA sections can not be split; regexp solutions invariably have cases they wouldn't handle, until one has basically written a full xml parser).

StaxMan 2010-07-10 16:12:57

Answer 6

A:

Having run into the same requirement ( to split a big XML file on the closure of top level child elements but in chunks ), I don't think csplit can achieve this if it only works as described in it's man page.

To be able to do this it would need..

The ability to group patterns and repeat a group, not just a single pattern
The ability to have a pattern that captured but did not split off a new file

That would enable a group like

tail bigfile.xml -n-1 | head -n+1 | csplit - '{ 25000 /<\/end>/ }' {*}

I see neither of these features described in it's man page (but think they would be useful additions).

Adrian 2010-10-19 15:12:11

ansaurus

tags:

views:

answers:

How to use linux csplit to chop up massive XML file?

related questions