tags:

views:

183

answers:

3

Let's say I have a line of text like this

Small   0.0..20.0   0.00    1.49    25.71   41.05   12.31   0.00    80.56

I want to capture the last six numbers and ignore the Small and the first two groups of numbers.

For this exercise, let's ignore the fact that it might be easier to just do some sort of string-split instead of a regular expression.

I have this regex that works but is kind of horrible looking

^(Small).*?[0-9.]+.*?[0-9.]+.*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+)

Is there some way to compact that?

For example, is it possible to combine the check for the last 6 numbers into a single statement that still stores the results as 6 separate group matches?

+5  A: 

If you want to keep each match in a separate backreference, you have no choice but to "spell it out" - if you use repetition, you can either catch all six groups "as one" or only the last one, depending on where you put the capturing parentheses. So no, it's not possible to compact the regex and still keep all six individual matches.

A somewhat more efficient (though not beautiful) regex would be:

^Small\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)

since it matches the spaces explicitly. Your regex will result in a lot of backtracking. My regex matches in 28 steps, yours in 106.

Just as an aside: In Python, you could simply do a

>>> pieces = "Small   0.0..20.0   0.00    1.49    25.71   41.05   12.31   0.00    80.56".split()[-6:]
>>> print pieces
['1.49', '25.71', '41.05', '12.31', '0.00', '80.56']
Tim Pietzcker
also, the .* in the original version can also match numbers, which can result in an invalid match. This one is better.
Wimmel
Using \s instead of .*? is definitely a good idea. I just hate repeating ([0-9.]+) over and over but it might be unavoidable.
Mark Biek
A: 

For usability, you should use string substitution to build regex from composite parts.

$d = "[0-9.]+"; 
$s = ".*?"; 

$re = "^(Small)$s$d$s$d$s($d)$s($d)$s($d)$s($d)$s($d)$s($d)";

At least then you can see the structure past the pattern, and changing one part changes them all.

If you wanted to get really ANSI you could make a short use metasyntax and make it even easier to read:

$re = "^(Small)_#D_#D_(#D)_(#D)_(#D)_(#D)_(#D)_(#D)"; 
$re = str_replace('#D','[0-9.]+',$re); 
$re = str_replace('_', '.*?' , $re );

( This way it also makes it trivial to change the definition of what a space token is, or what a digit token is )

Kent Fredric
+3  A: 

Here is the shortest I could get:

^Small\s+(?:[\d.]+\s+){2}([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s*$

It must be long because each capture must be specified explicitly. No need to capture "Small", though. But it is better to be specific (\s instead of .) when you can, and to anchor on both ends.

PhiLho
I think that answers my question if each capture has to be specified explicitly.
Mark Biek