tags:

views:

67

answers:

5

First, I don't know if this is actually possible but what I want to do is repeat a regex pattern. The pattern I'm using is:

sed 's/[^-\t]*\t[^-\t]*\t\([^-\t]*\).*/\1/' films.txt

An input of

250.    7.9    Shutter Island (2010)    110,675

Will return:

Shutter Island (2010)

I'm matching all none tabs, (250.) then tab, then all none tabs (7.9) then tab. Next I backrefrence the film title then matching all remaining chars (110,675).

It works fine, but im learning regex and this looks ugly, the regex [^-\t]*\t is repeated just after itself, is there anyway to repeat this like you can a character like a{2,2}?

I've tried ([^-\t]*\t){2,2} (and variations) but I'm guessing that is trying to match [^-\t]*\t\t?

Also if there is any way to make my above code shorter and cleaner any help would be greatly appreciated.

A: 

You can repeat things by putting them in parenthesis, like this:

([^-\t]*\t){2,2}

And the full pattern to match the title would be this:

([^-\t]*\t){2,2}([^-\t]+).*

You said you tried it. I'm not sure what is different, but the above worked for me on your sample data.

Sam
I was trying things out myself, and just used what you wrote here and it's not working for me either. I tried with plain parens as you typed it `( ... )` (not expecting it to work) and escaped parens `\\( ... \\)`, also escaping the `\+` ... my `sed --version` says `GNU sed version 4.1.5` and it's on RedHat Enterprise 5.1 [oh look, the backslash didn't show in the comment until I doubled it `\\\(`]
Stephen P
+1  A: 

I think you might be going about this the wrong way. If you're simply wanting to extract the name of the film, and it's release year, then you could try this regex:

(?:\t)[\w ()]+(?:\t)

As seen in place here:

http://regexr.com?2sd3a

Note that it matches a tab character at the beginning and end of the actual desired string, but doesn't include them in the matching group.

andy matthews
It might also help to explain what you want the result to be.
andy matthews
Cheers this works perfectly, thanks for the link, will help with debugging/learning better regex.
akd5446
I like how concise this is and see in your link how it *matches*, but how is it used, and with what command, to *extract* the name/date from the line? I don't see using it with `sed` since it doesn't have a capturing group and replacement. I'll upvote if you add an example of using it in a command to actually produce output that lists the name(s) from a file.
Stephen P
I'd love to give you a real life example, but I don't know sed. I'm sure this could be rewritten without using non-capturing groups but I'll have to do some research on it.
andy matthews
Thats a Perl regular expression which `sed` doesn't understand.
Dennis Williamson
A: 

why are you doing things the hard way??

$ awk '{$1=$2=$NF=""}1' file
  Shutter Island (2010)
ghostdog74
Thanks this also works. Will go and learn some more linux.
akd5446
+1  A: 

If this is a tab separated file with a regular format I'd use cut instead of sed

cut -d' ' -f3 films.txt

Note there's a single tab between the quotes after the -d which can be typed at the shell prompt by typing ctrl+v first, i.e. ctrl+v ctrl+i

Stephen P
there are spaces between movie names.
ghostdog74
Thank you this also works, would up-vote but cant yet.
akd5446
@ghostdog : according to the OPs regex there are tabs, not spaces.
Stephen P
A: 

This works for me:

sed 's/\([^\t]*\t\)\{2\}\([^\t]*\).*/\2/' films.txt

If your sed supports -r you can get rid of most of the escaping:

sed -r 's/([^\t]*\t){2}([^\t]*).*/\2/' films.txt

Change the first 2 to select different fields (0-3).

This will also work:

sed 's/[^\t]\+/\n&/3;s/.*\n//;s/\t.*//' films.txt

Change the 3 to select different fields (1-4).

Dennis Williamson