tags:

views:

177

answers:

3

Hi all,

I'm trying to join sentences in a document, but some of the sentences have been split apart with an empty line in between. For example:

The dog chased after a ball

that was thrown by its owner.

The ball travelled quite far.

to:

The dog chased after a ball that was thrown by its owner.

The ball travelled quite far.

I was thinking I could search for an empty line and then the beginning of the next line for a lower case character. It copies that line, removes it and the empty line above it, and then appends the copied sentence to the other broken sentence (sorry for the confusion).

I'm new to sed and tried it with this command:

sed "/$/{:a;N;s/\n\(^[a-z]* .*\)/ \1/;ba}"

But only does it once and only removes the empty line and not appending the 2nd half of the broken sentence to the first part.

Please help.

+1  A: 

Hi,

First time I used sed to perform such intricate replacements. It took me around 2 hours to come up with something :D

I used GNU sed as I wasn't able to get branching working on my mac on a single line.

Here is the input content I used for testing:

The dog chased after a ball

that was thrown by its owner.

The ball

travelled quite far.
I took me a while to fix this file.
And now it's

working :)

Then here is the sed command line I came up with:

$ sed -n '/^$/!bstore;/^$/N;s/\n\([a-z]\)/ \1/;tmerge;h;d;:store;H;b;:merge;H;g;s/\n \([a-z]\)/ \1/;p;s/.*//g;h;d' sentences.txt

And here is the output:

$ sed -n '/^$/!bstore;/^$/N;s/\n\([a-z]\)/ \1/;tmerge;h;d;:store;H;b;:merge;H;g;s/\n \([a-z]\)/ \1/;p;s/.*//g;h;d' sentences.txt

The dog chased after a ball that was thrown by its owner.

The ball travelled quite far.

I took me a while to fix this file.
And now it's working :)

You can notice there is an empty line inserted right at the beginning, but I think one can live with that. Please guys, comment on this one if you're mastering sed as this is just a novice shoot.

Gregory Pakosz
On your Mac, you might try breaking the `sed` script into multiple `-e` parts. Some versions of `sed` require that.
Dennis Williamson
+2  A: 

This should do the trick:

sed ':a;$!{N;N};s/\n\n\([a-z]\)/ \1/;ta;P;D' sentences
Dennis Williamson
+1, looks much better, I thought I couldn't do without the hold space -- at least I tried :)
Gregory Pakosz
Thanks very much everyone! :) I was testing out on a gedit plain text document that has some text in it and for some reason it did not work, but the example I gave earlier with the dog sentence did. The reason for this was that some of the lines had \r\n (carriage return + new line) hidden. I just had to remove the \r and everything worked out.
dmsu
+1  A: 

if you have Python, you can try this snippet

import string
f=0
data=open("file").readlines()
alen=len(data)
for n,line in enumerate(data):
    if line[0] in string.uppercase:
        found_upper=n
        f=1
    if f and line[0] in string.lowercase:
        data[found_upper] = data[found_upper].strip() + " " + line
        data[n]=""
    if n+1==alen:
        if line[0] in string.lowercase:
            data[found_upper] = data[found_upper].strip() + " " + line
            data[n]=""
        else : data[n]=line

output( added more scenarios of your file format)

$  cat file    
the start
THE START
The dog chased after a ball
that was thrown by its owner.

My ball travelled quite far




and it smashed the windows
but it didn't cause much damage


THE END
THE FINAL DESTINATION
final
FINAL DESTINATION LAST EPISODE
the final final

$ ./python.py
the start
THE START
The dog chased after a ball that was thrown by its owner.

My ball travelled quite far and it smashed the windows but it didn't cause much damage






THE END
THE FINAL DESTINATION final
FINAL DESTINATION LAST EPISODE the final final the final final
ghostdog74