ansaurus

Question

Answer 1

A:

Not possible in just grep I believe

for sed:

name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`

I'll take a stab at the bonus though:

echo "$name.jpg"

cobbal 2009-12-12 01:00:33

Ah, of course, thanks for that haha.

Isaac Hodes 2009-12-12 01:05:09

Unfortunately, that `sed` solution doesn't work. It simply prints out everything in my directory.

Isaac Hodes 2009-12-12 01:14:18

updated, will output a blank line if there isn't a match, so be sure to check for that

cobbal 2009-12-12 01:19:17

It now outputs only blank lines!

Isaac Hodes 2009-12-12 01:24:54

this sed has a problem. The first group of capturing parenthesis encompass everything. Of course \2 will have nothing.

2009-12-12 04:36:48

it worked for some simple test cases... \2 gets the inner group

cobbal 2009-12-12 06:01:12

Answer 2

+1 A:

A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:

f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}

Then name will have the value abc.

See Apple developer docs, search forward for 'Parameter Expansion'.

martin clayton 2009-12-12 01:16:46

Ah, now this does work. But is it *unix-y* enough? Hmm...

Isaac Hodes 2009-12-12 01:42:16

this will not check for ([a-z]+).

2009-12-12 04:09:26

@levislevis - that's true, but, as commented by the OP, it does do what was needed.

martin clayton 2009-12-12 05:18:36

Answer 3

+2 A:

This isn't really possible with pure grep, at least not generally.

But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).

Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:

echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'

The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.

(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).

Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)

RobM 2009-12-12 01:26:04

Very interesting. I hadn't even heard of `cut`! How might I store the output of that to a variable, though? Does `cut` return the string it's just operated on, unlike `greg`?

Isaac Hodes 2009-12-12 01:38:15

`for f in $files; do name=`echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'| cut -d _ -f 2`;` Aha!

Isaac Hodes 2009-12-12 01:43:48

using shell, no need for grep + cut. wasting overheads if OP has lots of files..

2009-12-12 04:10:25

i disagree with that "philosophy". if you can use the shell's in built capabilities without calling external commands, then your script will be a lot faster in performance. there are some tools that overlap in function. eg grep and sed and awk. all of them does string manipulations, but awk stands out above them all because it can do a lot more. Practically, all those chaining of commands, like the above double greps or grep+sed can be shortened by doing them with one awk process.

ghostdog74 2009-12-12 04:43:46

@ghostdog74: No argument here that chaining lots of tiny operations together is generally less efficient than doing it all in one place, but I stand by my assertion that the Unix philosophy is lots of tools working together. For instance, tar just archives files, it doesn't compress them, and because it outputs to STDOUT by default you can pipe it across the network with netcat, or compress it with bzip2, etc. Which to my mind reinforces the convention and general ethos that Unix tools should be able to work together in pipes.

RobM 2009-12-13 14:26:25

Answer 4

+4 A:

If you're using Bash, you don't even have to use grep:

files="*.jpg"
for f in $files
do
    [[ $f =~ [0-9]+_([a-z]+)_[0-9a-z]* ]]
    name="${BASH_REMATCH[1]}"
    echo "${name}.jpg"    # concatenate strings
    name="${name}.jpg"    # same thing stored in a variable
done

This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.

You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:

123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz

To eliminate the second and fourth examples, make your regex like this:

^[0-9]+_([a-z]+)_[0-9a-z]*

which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:

^[0-9]+_([a-z]+)_[0-9a-z]*$

then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.

Dennis Williamson 2009-12-12 02:59:03

Thanks Dennis! I appreciate the detailed help – I had completely forgot about the `=~` operator (very new the Bash scripting, so I've seen it maybe once or twice). I've **never** seen `${BASH_REMATCH[n]}`! That would have saved me ages. Thanks so much! (Aside: the regex I made doesn't handle cases like the ones described very well, but it handled the large number of .jpg's i wanted to rename. I appreciate the extra RegEx explanations, too, though.) Cheers!

Isaac Hodes 2009-12-12 03:58:59

Answer 5

+1 A:

if you have bash, you can use extended globbing

shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

or

ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

2009-12-12 04:06:06

That looks intriguing. Could you perhaps append a little explanation to it? Or, if you're so inclined, link to a particularly insightful resource that explains it? Thanks!

Isaac Hodes 2009-12-12 04:14:44

bash reference manual - 3.5.8.1 Pattern Matching

2009-12-12 04:27:47

forgot the link: here it is http://www.gnu.org/software/bash/manual/bashref.html

2009-12-12 04:31:00

ansaurus

tags:

views:

answers:

Capturing Groups From a Grep RegEx

related questions