tags:

views:

302

answers:

3

So, folks, I have this self crafted pattern that works. After some hours (I am no regex guru) this puppy evolved to parse curl PUT output for me:

   ^\s*([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)
    \s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)

(CR in text only for formatting)

It gives me 'groups' that I access--it works! Yet the coder in me sees the repetition of a pattern, and it bugs the frack out of me. I've seen perl how-small-is-your-pattern contests over the years that makes me think this could be much smaller. But my attempts to slap a * in it have failed miserably.

So, The Question Is: how do write this pattern in a more concise way so that I can still pull out my target groups?

It probably doesn't matter, but here are the groups I am after:

$1: percent finished
$2: size uploaded so far
$6: size to upload
$8: average upload rate

Update: Further background can by found on a blog post of mine (How to configure OnMyCommand to generate a progress bar for curl) that will explain what I am doing and why I am after only a regex pattern. I'm not actually coding in a language, per se...but configuring a tool to use a regex.

A: 
((^\s*|\s+)([^ ]+)){12}

If you do not care about the number of matches and want to match a complete string, just stick with the following.

((^\s*|\s+)([^ ]+))*\s*$
Daniel Brückner
At least in Perl that doesn't work, you only get the last of the 11 repeated matches in the capture.
Chas. Owens
I am using .NET/C#. So I cannot tell about ather regex implementations.
Daniel Brückner
Doesn't look like it works in JavaScript either. I don't know what regex implementation he's working with, but I doubt it'll capture the groups he needs with a {12}.
ojrac
.NET lets you break out individual captures from repeated groups, but AFAIK it's unique in that regard. Also, the only way to retrieve individual captures is via API calls like M.Groups(2).Captures(6). (Maybe someday there will be a shorthand notation like "$2#6", but I doubt it.) This is not a solution Stu can use.
Alan Moore
Nope, did not work for me in OMC.
Stu Thompson
+2  A: 

It looks like this is the best I can do:

^\s*([^ ]+)\s+([^ ]+)\s+(?:[^ ]+\s+){3}([^ ]+)\s+[^ ]+\s+([^ ]+)\s+

I collapsed the matches you do not care about, made them not capture, and left off the unneeded trailing matches. If it is important to match everything (e.g. there are other lines that would match this) you can say:

^\s*([^ ]+)\s+([^ ]+)\s+(?:[^ ]+\s+){3}([^ ]+)\s+[^ ]+\s+([^ ]+)(?:\s+[^ ]){4}

Note, my changes also change the capture numbers:

  • $1: percent finished
  • $2: size uploaded so far
  • $3: size to upload
  • $4: average upload rate

You may be able to get away with this if it supports \S

^\s*(\S+)\s+(\S+)\s+(?:\S+\s+){3}(\S+)\s+\S+\s+(\S+)\s+

but it does not mean exactly the same thing.

Chas. Owens
Thanks for your answer and comments elsewhere. It's bedtime for Bonzo but will test out your options, and others, in the morning. Thanks!
Stu Thompson
Fantastic, thanks. The first worked (with giggling the capture numbers from my original) for an expression that is only 55% of the original. I didn't try the second as it was longer. The third did not work.
Stu Thompson
A: 

If your regex uses greedy matching this might work:

^(\s*([^ ]+))+$

explanation:

  • ^ = start of line
  • repeated pattern = \s*([^ ]+)
  • surround that with parens and add '+' to indicate 'one or more matches of the preceeding'
  • $ = end of line
Jay
it looks like he is trying to extract four distinct values from the output, this produces two captures, one of which is a submatch of the other.
Chas. Owens