views:

786

answers:

7

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.

But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:

Example regular expression:

.*abc([0-9]+)xyz.*

Example input file:

a
b
c
abc12345xyz
a
b
c

As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:

myvalue=$( sed <...something...> input.txt )

Things I've tried include:

sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
+2  A: 

I use perl to make this easier for myself. e.g.


perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'

This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.

The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).

You can do this will multiple file names on the end also. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

PP
Thanks, but we don't have access to perl, which is why I was asking about sed/awk/gawk.
Stéphane
+1  A: 

If you want to select lines then strip out the bits you don't want:

egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'

It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.

You can see this in action here:

pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>

Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:

egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
paxdiablo
Interesting... So there isn't a simple way to apply a complex regular expression and get back just what is in the (...) section? Cause while I see what you did here first with grep then with sed, our real situation is much more complex than dropping "abc" and "xyz". The regular expression is used because lots of different text can appear on either side of the text I'd like to extract.
Stéphane
I'm sure there *is* a better way if the REs are really complex. Perhaps if you provided a few more examples or a more detailed description, we could adjust our answers to suit.
paxdiablo
+3  A: 

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:

sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt

For matching at least one numeric character without +, I would use:

sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
mouviciel
Thank you, this worked for me as well once I used * instead of +.
Stéphane
...and the "p" option to print the the match, which I didn't know about either. Thanks again.
Stéphane
I had to escape the `+` and then it worked for me: `sed -n 's/^.*abc\([0-9]\+\)xyz.*$/\1/p'`
Dennis Williamson
A: 

For awk. I would use the following script:

/.*abc([0-9]+)xyz.*/ {
      print $0;
      next;
      }
      {
      /* default, do nothing */
      }
Pierre
which gets grep like behavior...
dmckee
A: 
gawk '/.*abc([0-9]+)xyz.*/' file
ghostdog74
This doesn't seem to work. It prints the entire line instead of the match.
Stéphane
in your sample input file , that pattern is the whole line. right??? if you know the pattern is going to be in a specific field: use $1, $2 etc.. eg gawk '$1 ~ /.*abc([0-9]+)xyz.*/' file
ghostdog74
A: 

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.

If not then here's the best sed I could come up with:

sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).

The problem with something like:

sed -e 's/.*\([0-9]*\).*/&/'

.... or

sed -e 's/.*\([0-9]*\).*/\1/'

... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

Jim Dennis
You can just combine two of your `sed` commands in this way: `sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p'`
Dennis Williamson
Previously didn't know about -o option on grep. Nice to know. But it prints the entire match, not the "(...)". So if you are matching on "abc([[:digit:]]+)xyz" then you get the "abc" and "xyz" as well as the digits.
Stéphane
A: 

you can do it with the shell

while read -r line
do
    case "$line" in
        *abc*[0-9]*xyz* ) 
            t="${line##abc}"
            echo "num is ${t%%xyz}";;
    esac
done <"file"