tags:

views:

292

answers:

2

This seems like it should be dirt simple, but the awk gensub/gsub/sub behavior has always been unclear to me, and now I just can't get it to do what the documentation says it should do (and what experience with a zillion other similar tools suggests should work). Specifically, I want to access "captured groups" from a regex in the replacement string. Here's what I think the awk syntax should be:

awk '{ gsub(/a(b*)c/, "Here are bees: \1"); print; }'

That should turn "abbbc" into "Here are bees: bbb". It does not, at least not for me in Ubunutu 9.04. Instead, the "\1" is rendered as a ^A; that is, the character with code 1. Not what I want, of course. How do I do this?

Thanks.

A: 

echo abbc | awk '{ print gensub(/a(b*)c/, "Here are bees: \\1", "g", $1);}'

See manual here to see the difference between gsub and gensub

Edit: something is wrong with the "editor here" it transform \\ to \ ?!

RC
That doesn't work!
Pointy
Also, this issue has nothing to do with the difference between gsub and gensub - both routines (and sub also) behave exactly the same way.
Pointy
Oops sorry you need to double the \ in \1:`echo abbc | awk '{ print gensub(/a(b*)c/, "Here are bees: \\1", "g", $1);}'`
RC
ah wait - yes, it does have to do with the difference between gensub and gsub. Looks like gawk is pretty broken; no wonder I've been confused all these years.The example given in the answer above is still wrong: the backslash needs to be doubled in order to make it actually work.
Pointy
And gsub and gensub don't behave the same, gsub returns the number of substitution and gensub returns the substitued string
RC
OK yes, @RC, thanks - sorry; we collided in the comment.
Pointy
Also, not only to gsub and gensub behave differently with respect to return value, but the whole \1 through \9 feature *only* works with gensub.
Pointy
A: 

Per the gawk manual

gensub provides an additional feature that is not available in sub or gsub: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9.

You must use gensub, you must specify "g", and you must grab the result of gensub, since it does not modify in-place.

awk '{ r = gensub(/a(b*)c/, "Here are bees: \\1", "g"); print r; }'
Jonathan Feinberg