ansaurus

Question

How to write find-all function (with regex) in awk or sed

Answer 1

A:

I suggest you use grep -o.

-o, --only-matching
       Show only the part of a matching line that matches PATTERN.

E.g.:

$ cat > foo
test test test
test
bar
baz test
$ grep -o test foo
test
test
test
test
test

Update

If you were extracting href attributes from html files, using a command like:

$ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html
href="style.css"
href="iehacks.css"
href="old/"

You could extract the values by using cut and sed like this:

$ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html| cut -f2 -d'=' | sed -e 's/"//g'
style.css
iehacks.css
old/

But you'd be better off using html/xml parsers for reliability.

MattH 2010-09-14 10:03:34

It work fine but, when I use `grep -o -E 'href="([^"]*)"'` it's return the whole matched string not first group (from parentheses).

jcubic 2010-09-14 10:34:05

Yes, it will. You didn't mention that as a requirement. What are your requirements?

MattH 2010-09-14 11:00:00

Answer 2

A:

Here's a gawk implementation (not tested with other awks): find_all.sh

awk -v "patt=$1" '
    function find_all(str, patt) {
        while (match(str, patt, a) > 0) {
            for (i=0; i in a; i++) print a[i]
            str = substr(str, RSTART+RLENGTH)
        }
    }
    $0 ~ patt {find_all($0, patt)}
' -

Then:

echo 'asdf href="href1" asdf asdf href="href2" asdfasdf
asdfasdfasdfasdf href="href3" asdfasdfasdf' | 
find_all.sh 'href="([^"]+)"'

outputs:

href="href1"
href1
href="href2"
href2
href="href3"
href3

Change i=0 to i=1 if you only want to print the captured groups. With i=0 you'll get output even if you have no parentheses in your pattern.

glenn jackman 2010-09-17 19:19:28

ansaurus

tags:

views:

answers:

How to write find-all function (with regex) in awk or sed

related questions