views:

1352

answers:

5

Hi

I'm writing a shell script, which at some point has to take a file, search for a particular word in it and delete the whole text that comes after this word (including the word itself) - awk is the right tool I suppose, but I don't really know much about programming in it.

Could anyone help me?

+6  A: 

I suppose 'awk' is one tool for the job, though I think 'sed' is simpler for this particular operation. The specification is a bit vague. The simple version is:

  • Find the first line containing a given word.
  • Delete that line and all following lines.

For that, I'd use 'sed':

sed '/word/,$d' file

The more complex version is:

  • Find the first line containing a given word.
  • Delete the text on that line from the word onwards.
  • Delete all subsequent lines of text.

I'd probably still use 'sed':

sed -n '1,/word/{s/word.*//;p}' file

This inverts the logic. It doesn't print anything by default, but for lines 1 until the first line containing word it does a substitute (which does nothing until the line containing the word), and then print.

Can it be done in 'awk'? Not completely trivially because 'awk' autosplits input lines into words, and because you have to use functions to do substitutions.

awk '/word/ { if (found == 0) {
                # First line with word
                sub("word.*", "")
                print $0;
                found = 1
              }
            }
            { if (found == 0) print $0; }' file

(Edited: change 'delete' to 'found' since 'delete' is a reserved word in 'awk'.)

In all these examples, the truncated version of the input file is written to standard output. To modify the file in situ, you either need to use Perl or Python or a similar language, or you capture the output in a temporary file which you copy over the original once the command has completed. (If you try 'script file' you process an empty file.)

There are various early exit optimizations that could be applied to the sed and awk scripts, such as:

sed '/word/q' file

And, if you assume the use of the GNU versions of awk or sed, there are various non-standard extensions that can help with in-situ modification of the file.

Jonathan Leffler
Agreed, I'd probably do this in sed, too.
Stobor
sed -e ' /\<word\>.*/{s///; q}' does the same thing, and only specifies the word once. (I had a previous comment purporting to do the same thing, but the match was wrong...) Also, you probably want to specify \<word\> to avoid getting caught on someone's swords.
Stobor
@Stobor: well, of course, we're getting into interesting territory with the definition of words, and also the definition of which regex syntax the version of 'sed' supports. The '\<word\>' notation is excellent when supported; it is not supported traditionally, though I find that it is supported on Solaris (somewhat to my surprise).
Jonathan Leffler
@Stobor: avoiding repeating the word is advantageous. Also, similar comments about the definition of 'word' in 'awk' could apply. All these are refinements on the basic technique - exercises for the reader, if you will (or original poster).
Jonathan Leffler
A: 

I'm assuming your input is something like this:

Lorem ipsum dolor sit amet,
consectetur adipiscing velit.
Nullam neque sapien, molestie vel congue non,
feugiat quis tellus. Ut quis
nulla mi. Maecenas a ligula.

and you want the output to be cut off at the word 'vel' like so:

Lorem ipsum dolor sit amet,
consectetur adipiscing velit.
Nullam neque sapien, molestie

In that case, your awk script would be:

cat lorem.txt | awk ' 
  /\<vel\>/ 
  {
     print substr($0, 0, match($0, /\<vel\>/) - 1); 
     exit; 
  } 

  { print }
'

The word you want to cut off at needs to replace both instances of the word vel in the script.

You can safely put the entire script on one line, too.

Stobor
@Stobor useless use of cat, it's never a good idea.
Erik
@Erik: I'm not going to get into the argument over the word "never"... Suffice to say I agree that it's not useful here.
Stobor
A: 

I'm not sure how to do it with awk, but you could do it with sed:

sed -i~ -e 's/the-word-to-find.*$//' the-file

This will delete everything from the-word-to-find to the end of the line, on every line that contains the-word-to-find. If you want to delete the rest of the file upon the first occurrence of the-word-to-find, you could do:

sed -i~ -e 's/\(the-word-to-find\).*$/\1/;/the-word-to-find/,$d'
Adam Rosenfield
The second one worked perfectly - thanks a lot :)
A: 

This awk one-liner should do the trick: { sub(/ word.*/, ""); print } For every line, if the line contains a pattern that starts with word (proceeded by space) and goes to the end of the line - replace the pattern with the empty string - then print the updated line.

[ Figured the question could read either way (whole text on that line or whole text in the file). If one wanted to skip the rest of the file one could: { skip = gsub(/ word.*/, ""); print ; if (skip) exit } ]

dhn
I don't think that addresses the question - it doesn't ignore the remainder of the file after the first occurrence of the searched-for word.
Jonathan Leffler
A: 
awk '/word/{exit}1' file
ghostdog74