tags:

views:

142

answers:

4

I have written code to pull some data into a data table and do some data re-formatting. I need some help splitting some text into appropriate columns.

CASE 1 I have data formated like this that I need to split into 2 columns.

ABCDEFGS     0298 MSD
SDFKLJSDDSFWW         0298 RFD

I need the text before the numbers in column 1 and the numbers and text after the spaces in column 2. The number of spaces between the text and the numbers and will vary.

CASE 2 Data I have data like this that I need split into 3 columns.

00006011731 TAB FC 10MG 30UOU
00006011754  TAB FC 10MG 90UOU
00006027531  TAB CHEW 5MG 30UOU
00006071131  TAB CHEW 4MG 30UOU
00006027554  TAB CHEW 5MG 90UO
00006384130  GRAN PKT 4MG 30UOU
  1. column is the first 11 characters That is easy
  2. column 2 should contain all the text after the first 11 characters up to but not including the first number.
  3. The last column is all the text after column 2
+1  A: 

Supposing you know how to handle the VB.NET code to get the groupings (matches) and that you are willing to strip the extra spaces from the groupings yourself

The Regex for case 1 is

(.*?\s+)(\d+.*)
    .*? => grabs everything non greedily, so it will stop at the first space
    \s+ => one or more whitespace characters

    These two form the first group.

    \d+ => one or more digits
    .* => rest of the line

    These two form the second group.

The Regex for case 2 is

(.{11})(.*?)(\d.*)
    .{11} => matches 11 characters (you could restrict it to be just letters
             and numbers with [a-zA-Z] or \d instead of .)

    That's the first group.

    .*? => Match everything non greedily, stop before the first 
           digit found (because that's the next regex)

    That's the second group.

    \d.* => a digit (used to stop the previous .*?) and the rest of the line

    That's the third group.
Vinko Vrsalovic
+2  A: 

I would do it with these expressions:

(?-s)(\S+) +(.+)

and

(?-s)(.{11})(\D+)(.+)


And broken down in regex comment mode, those are:

(?x-s)    # Flags: x enables comment mode, -s disables dotall mode.
(       # start first capturing group
 \S+     # any non-space character, greedily matched at least once.
)       # end first capturing group
[ ]+     # a space character, greedily matched at least once. (brackets required in comment mode)
(       # start second capturing group
 .+      # any character (excluding newlines), greedily matched at least once.
)       # end second capturing group

and

(?x-s)    # Flags: x enables comment mode, -s disables dotall mode.
(       # start first capturing group
 .{11}   # any character (excluding newlines), exactly 11 times.
)       # end first capturing group
(       # start second capturing group
 \D+     # any non-digit character, greedily matched at least once.
)       # end second capturing group
(       # start third capturing group
 .+      # any character (excluding newlines), greedily matched at least once.
)       # end third capturing group


(The 'dotall' mode (flag s) means that . matches all characters, including newlines, so we have to disable it to prevent too much matching in the last group.)

Peter Boughton
A: 

I would use Peter Boughton's regexes, but ensure you have . matches newline turned off. If that is on, ensure you add a $ on the end :)

The greedy regexes will perform better.

Billy ONeal
The dotall mode (. matches newline) is usually off unless explicitly enabled - is VB regex different in this respect? *shrug* I'll update the question to explicitly disable it, to be on the safe side.
Peter Boughton
Also, since there is no overlap in the characters, why would the non-greedy quantifiers perform better here? I would even be tempted to go for possessive quantifiers (to fail faster if invalid data found).
Peter Boughton
http://www.regular-expressions.info/repeat.html <-- Scroll to "An Alternative to Laziness"
Billy ONeal
Usually it's off by default .... but if the asker is using someone else's code and pasting it in.... ;)Billy3
Billy ONeal
Exactly. Using negated classes that do not overlap means greedy is the better/simpler option than lazy. That's what I'm doing - \D and \S are both negated classes (shorthand for [^\d] and [^\s]) - did you mis-write the "non-greedy" part of your answer, or am I misunderstanding what you're saying?
Peter Boughton
I said your regex is faster because it uses a non greedy expression, compared to Vinko Vrsalovic's which uses greedy expressions.
Billy ONeal
I just realized I got flip flopped there hehe... thank you. Edited.
Billy ONeal
OK...so you guys know alot more about this than I do. Some more information on the problem. In Case 1 the text I really really need is the second part of the string. If the string contains 4 numbers followed by a space and then 3 lettes at the end of the string then I need to strip it out. I was trying to use the regex.split function, but it kept returning me an array of 3-4 matches depending on which regex from this thread that I used. I was hoping to have the split function only return me the exact string that matches the pattern.
Kalel
Use Regex.Match instead of Regex.Split. Then use Peter's regex. Use capturing groups instead of split.
Billy ONeal
For more info on that, see here -> http://www.regular-expressions.info/brackets.html
Billy ONeal
A: 

The simplest way for the kind of data you are presenting is to split the line into fields at the spaces, then reunite what you want to have together. Regex.Split(line, "\\s+") should return an array of strings. This is also more robust against changing strings in the fields, for example if in the second case a line reads "00006011731 TAB 3FC 10MG 30UOU".

Svante