tags:

views:

43

answers:

5

How do I extract text in between strings with very specific pattern from a file full of these lines? Ex:

input:a_log.gz:make=BMW&year=2000&owner=Peter

I want to essentially capture the part make=BMW&year=2000. I know for a fact that the line can start out as "input:(any number of characters).gz:" and end with "owner=Peter"

A: 

Use the regex: input:.*?\.gz:(.*?)&?owner=Peter. The capture will contain the things between the second colon and "owner=Peter", trimming the ampersand.

Borealid
What utility would I use with the regex? Just grep?
syker
Borealid
@Borealid, that's not working for me. Fails on "\1"
syker
Also what is with the question marks?
syker
In order to make it work you need backslashes on the parantheses.
syker
@syker: The question marks are to make the pattern not greedy, but that doesn't work with `sed`.
Dennis Williamson
A: 

Give this a try:

sed -n 's/.*:\([^&]*&[^&]*\)&.*/\1/p' file

This will extract everything between the second colon and the second ampersand regardless of what's before and after (if there are more colons or ampersands it may not work properly).

Dennis Williamson
A: 

you can use the shell(bash/ksh)

$ s="input:a_log.gz:make=BMW&year=2000&owner=Peter"
$ s=${s##*gz:}
$ echo ${s%%owner=Peter*}
make=BMW&year=2000&

if you want sed

$ echo ${s} | sed 's/input.*gz://;s/owner=Peter//'
make=BMW&year=2000&
ghostdog74
A: 
>echo "input:a_log.gz:make=BMW&year=2000&owner=Peter"|sed -e "s/input:.*.gz://g" -e "s/&owner.*//g"
make=BMW&year=2000
Vijay Sarathi
A: 

I didn't see an answer using awk:

awk '{ match($0, /input:.*\.gz:/);
       m = RSTART+RLENGTH;
       n = index($0, "&owner=Peter") - m;
       print substr($0,m,n)
     }'

The method is sort of a mix between the sh version (substring by parameter expansions) and the sed (regular expressions) versions. This is because awk RE's lack backreferences.

schot
that is if you are using nawk or old awk. gawk has backreferences.
ghostdog74
@ghostdog74 I'm using `awk` as described in my copy of "The AWK Programming Language". `gawk` != `awk` (although perhaps `gawk` > `awk`). One of the advantages of using `awk` for me is default availability, which disappears when you write `gawk`-specific code.
schot