ansaurus

Question

Using sed for introducing newline after each > in a +1 gigabyte large one-line text file

Answer 1

+2 A:

some sed has a limit on that. GNU sed has no limit as long as it can `malloc()' more (virtual) memory, you can feed or construct lines as long as you like. (from the doc)

I would suggest, if possible, to change how you create that xml file. ( Why is it all in one line in the first place ? ). Otherwise, you could read it one character by character. eg using the shell

while read -n 1 ch
do
  case "$ch" in
   ">" ) printf "%s\n" "$ch";;
     *) printf "%s" $ch;;
  esac
done <"file"

or

while read -n 1000 str ; do
 echo "${str//>/>
}"
done < file

ghostdog74 2010-03-18 09:10:57

Good one, but can be optimized by: `while read -n 1000 str ; do echo -n "$str" | sed 's/>/>\n/g' ; done < file`

Chen Levy 2010-03-18 09:33:40

Oh believe me, I've asked myself several times why it's all in one line in the first place (often followed by some very creative cursing) :) Sadly it's not something I can do anything about. The reading one character at a time idea seems to work pretty well though. I'd hoped to not have to do that, but it works. Thanks!

wasatz 2010-03-18 09:54:47

@Chen, i would cut the use of `sed` and just use internal shell substituion

ghostdog74 2010-03-18 10:22:04

@ghostdog74, I thought your suggestion is cool, but I timed `while read -n 1000 str ; do echo -ne "${str//>/>\n}" ; done < file` and it turned out to be significantly slower. Even slower then the one char at a time code above.

Chen Levy 2010-03-18 12:05:50

i believe not. calling external command is definitely slower than calling in built.

ghostdog74 2010-03-18 12:46:49

Your theory is sound, but I *did* test it. My guess is that the `sed` implementation is significantly more efficient then of the bash parameter expansion.

Chen Levy 2010-03-19 07:46:59

if you are talking about just using sed, then yes, BUT i am talking about using the `while loop + sed` vs `while loop + internals`. In that case, using `while loop + sed` is definitely slower.

ghostdog74 2010-03-19 08:07:22

ansaurus

tags:

views:

answers:

Using sed for introducing newline after each > in a +1 gigabyte large one-line text file

related questions