tags:

views:

380

answers:

4

I have a large text file I would like to put on my ebook-reader, but the formatting becomes all wrong because all lines are hard wrapped at or before column 80 with CR/LF, and paragraphs/headers are not marked differently, only a single CR/LF there too.

What I would like is to replace all CR/LF's after column 75 with a space. That would make most paragraphs continuous. (Not a perfect solution, but a lot better to read.)

Is it possible to do this with a regex? Preferably a (linux) perl or sed oneliner, alternatively a Notepad++ regex.

+1  A: 

This seems to get pretty close:

sed '/^$/! {:a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta}' ebook.txt

It doesn't get the last line of a paragraph if it's shorter than 75 characters.

Edit:

This version should do it all:

sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g' ebook.txt

Edit 2:

If you want to re-wrap at word/sentence boundaries at a different width (here 65, but choose any value) to prevent words from being broken at the margin (or long lines from being truncated):

sed 's/^.\{0,74\}$/&\n/' ebook.txt | fmt -w 65 | sed '/^$;s/\n//}'

To change from DOS to Unix line endings, just add dos2unix to the beginning of any of the pipes above:

dos2unix < ebook.txt | sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g'
Dennis Williamson
Working fine, but compared to the perl solution, didn't remove the DOS line endings (which I of course can remove with 'tr'), and it took a very long time, 10.2 seconds compared to 0.08 for perl.
Olav
+2  A: 
perl -p -e 's/\s+$//; $_ .= length() <= 75 ? qq{\n} : q{ }' book.txt

Perl's -p option means: for each input line, process and print. The processing code is supplied with the -e option. In this case: remove trailing whitespace and then attach either a newline or a space, depending on line length.

FM
Excellent! Both quick, working very well and understandable.
Olav
+1  A: 

Not really answering your question, but you can achieve this result in vim using this global join command. The v expands tabs into whitespace when determining line length, a feature that might be useful depending on your source text.

:g/\%>74v$\n/j
blissapp
A: 

The less fancy option would be to replace the cr/lf's that apperar by themselves on a line with a single lf or cr, then remove all the cr/lf's remaining. No need for fancy/complicated stuff.

regex 1: ^\r\n$ finds lone cr/lf's. It is then trivial to replace the remaining ones. See this question for help finding cr/lf's in np++.

zdav
Ah, but there are almost no CR/LF's by themselves. Many paragraphs are just short lines, where i want to keep the EOL. I chose column 75 because that will catch most multi-line wrapped paragraphs. I'll probably have to adjust the column number from file to file to get the optimal result.
Olav