tags:

views:

87

answers:

5

What's the single regex that enables me to capture all the text that goes after are genes and is gene from this text

The closest human genes of best are genes A B C
The closest human gene of best is gene A 

Hence I hope to extract $1 that contain

A B C
A 

Tried this but fail:

$line =~ /The closest .* gene[s] (.*)$/;
+4  A: 
$line =~ /The closest .* genes? (.*)$/;
SilentGhost
+1 for matching requester's example as close as possible, but this could benefit from some information explaining that [s] is the same as s, [s ] would have been what he was trying to accomplish with that, and that s? is equivalent.
kbenson
+2  A: 
$ perl -F/genes*/ -ane 'print $F[-1];' file
 A B C
 A
ghostdog74
A: 

With the other suggestions, I would like to suggest to have a look at the perllre for Regular Expressions

Space
+2  A: 

Use non-greedy at the beginning to reduce the opportunities for surprises. Use non-capturing parens to group alternatives that you don't care about. Append ? to a letter to make it optional. Hence, try this:

$line =~ /The closest .*? (?:is|are) genes? (.*)$/;

To see where you were going wrong BTW, just compare the above with what you were originally trying.

Donal Fellows
It captures some cases that are bad grammar too (“The closest ... is genes ..”) but that's hardly important, yes? :-)
Donal Fellows
@Donal: if it's not important why bother with that non-capturing group at all?
SilentGhost
@SilentGhost: Without it, you'll capture from the first instance of the word "gene" to the end, e.g., “`of best are genes A B C`”.
Donal Fellows
that's only because of using non-greedy quantifier
SilentGhost
There's not really enough input data samples in the question to be able to work out what is wanted. I personally prefer to match more in the fixed proportion to reduce the number of landmines^Wsurprises in the matched text.
Donal Fellows
+3  A: 

I think the most explicit is:

$line =~ m/best \s (?:is \s gene|are \s genes) \s ([\p{IsUpper}](?: \s [\p{IsUpper} ])*)/x;

Of course if you know that all sentences are going to be grammatical, then you can do the (?:are|is) thing. And if you know that you're only going to have genes A-N or something, you can forget the \p{IsUpper} and use [A-N].

Axeman