ansaurus

Question

Answer 1

+1 A:

Supposing you know how to handle the VB.NET code to get the groupings (matches) and that you are willing to strip the extra spaces from the groupings yourself

The Regex for case 1 is

(.*?\s+)(\d+.*)

    .*? => grabs everything non greedily, so it will stop at the first space
    \s+ => one or more whitespace characters

    These two form the first group.

    \d+ => one or more digits
    .* => rest of the line

    These two form the second group.

The Regex for case 2 is

(.{11})(.*?)(\d.*)

    .{11} => matches 11 characters (you could restrict it to be just letters
             and numbers with [a-zA-Z] or \d instead of .)

    That's the first group.

    .*? => Match everything non greedily, stop before the first 
           digit found (because that's the next regex)

    That's the second group.

    \d.* => a digit (used to stop the previous .*?) and the rest of the line

    That's the third group.

Vinko Vrsalovic 2009-06-18 21:21:35

Answer 2

+2 A:

I would do it with these expressions:

(?-s)(\S+) +(.+)

and

(?-s)(.{11})(\D+)(.+)

And broken down in regex comment mode, those are:

(?x-s)    # Flags: x enables comment mode, -s disables dotall mode.
(       # start first capturing group
 \S+     # any non-space character, greedily matched at least once.
)       # end first capturing group
[ ]+     # a space character, greedily matched at least once. (brackets required in comment mode)
(       # start second capturing group
 .+      # any character (excluding newlines), greedily matched at least once.
)       # end second capturing group

and

(?x-s)    # Flags: x enables comment mode, -s disables dotall mode.
(       # start first capturing group
 .{11}   # any character (excluding newlines), exactly 11 times.
)       # end first capturing group
(       # start second capturing group
 \D+     # any non-digit character, greedily matched at least once.
)       # end second capturing group
(       # start third capturing group
 .+      # any character (excluding newlines), greedily matched at least once.
)       # end third capturing group

(The 'dotall' mode (flag s) means that . matches all characters, including newlines, so we have to disable it to prevent too much matching in the last group.)

Peter Boughton 2009-06-18 21:31:51

Answer 3

A:

I would use Peter Boughton's regexes, but ensure you have . matches newline turned off. If that is on, ensure you add a $ on the end :)

The greedy regexes will perform better.

Billy ONeal 2009-06-18 21:34:04

The dotall mode (. matches newline) is usually off unless explicitly enabled - is VB regex different in this respect? *shrug* I'll update the question to explicitly disable it, to be on the safe side.

Peter Boughton 2009-06-18 21:45:43

Also, since there is no overlap in the characters, why would the non-greedy quantifiers perform better here? I would even be tempted to go for possessive quantifiers (to fail faster if invalid data found).

Peter Boughton 2009-06-18 21:49:43

http://www.regular-expressions.info/repeat.html <-- Scroll to "An Alternative to Laziness"

Billy ONeal 2009-06-18 22:23:58

Usually it's off by default .... but if the asker is using someone else's code and pasting it in.... ;)Billy3

Billy ONeal 2009-06-18 22:33:51

Exactly. Using negated classes that do not overlap means greedy is the better/simpler option than lazy. That's what I'm doing - \D and \S are both negated classes (shorthand for [^\d] and [^\s]) - did you mis-write the "non-greedy" part of your answer, or am I misunderstanding what you're saying?

Peter Boughton 2009-06-18 22:34:44

I said your regex is faster because it uses a non greedy expression, compared to Vinko Vrsalovic's which uses greedy expressions.

Billy ONeal 2009-06-19 04:46:23

I just realized I got flip flopped there hehe... thank you. Edited.

Billy ONeal 2009-06-19 04:47:02

OK...so you guys know alot more about this than I do. Some more information on the problem. In Case 1 the text I really really need is the second part of the string. If the string contains 4 numbers followed by a space and then 3 lettes at the end of the string then I need to strip it out. I was trying to use the regex.split function, but it kept returning me an array of 3-4 matches depending on which regex from this thread that I used. I was hoping to have the split function only return me the exact string that matches the pattern.

Kalel 2009-06-19 14:34:53

Use Regex.Match instead of Regex.Split. Then use Peter's regex. Use capturing groups instead of split.

Billy ONeal 2009-06-19 20:53:39

For more info on that, see here -> http://www.regular-expressions.info/brackets.html

Billy ONeal 2009-06-19 20:54:35

Answer 4

A:

The simplest way for the kind of data you are presenting is to split the line into fields at the spaces, then reunite what you want to have together. Regex.Split(line, "\\s+") should return an array of strings. This is also more robust against changing strings in the fields, for example if in the second case a line reads "00006011731 TAB 3FC 10MG 30UOU".

Svante 2009-06-18 21:34:06

ansaurus

tags:

views:

answers:

Regular Expression help

related questions