Hi everyone, I have a gigantic (4GB) XML file that I am currently breaking into chunks with linux "split" function (every 25,000 lines - not by bytes). This usually works great (I end up with about 50 files), except some of the data descriptions have line breaks, and so frequently the chunk files do not have the proper closing tags - and my parser chokes halfway through processing.
Example file: (note: normally each "listing" xml node is supposed to be on its own line)
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>
Then sometimes my split ends up like
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks ...
EOF
So - I have been reading about "csplit" and it sounds like it might work to solve this issue. I cant seem to get the regular expression right...
Basically I want the same output of ~50ish files
Something like:
*csplit -k myfile.xml '/</listing>/' 25000 {50}
Any help would be great Thanks!