views:

281

answers:

8

Hi, I'm trying to split a string into tokens (via regular expressions) in the following way:

Example #1
input string: 'hello'
first token: '
second token: hello
third token: '

Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '

Example #3
input string: hello world
first token: hello
second token: world

i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.

This is what I have so far:

string pattern = @"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");

This will work for example #1 and example #3 but it will NOT work for example #2. I'm wondering if there's theoretically a way to achieve what I want with regular expressions

+2  A: 

'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.

EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.

Daniel Straight
Right, you can't create a regular expression to parse through an undefined number of tokens (at least, not in a single step).
Scott Smith
+5  A: 

You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.

Stephen Cross
yes, but i want to have "hello world" as a single token. i find regex.split() to be very good in generating tokens, except for this one case...
Shnitzel
@Shnitzel: Then you should define a case in your lexer to consume more text if it is inside single quotes. Yes, regex.split() is a very simple option and from what you want to do it seems you might need something more powerful. Also, you might want to use one of the lexer and parser generators around for C#, they can make your life a lot easier.
Stephen Cross
+1 I think the OP is trying to drive screws with a hammer.
clintp
@clintp: Yes, it would seem that way :)
Stephen Cross
+3  A: 

Use a token parsor to split into tokens. Use regex to find a string patterns

TFD
+1  A: 

You can first split on quoted string, and then further tokenize.

foreach (String s in Regex.Split(input, @"('[^']+')")) {
    // Check first if s is a quote.
    // If so, split out the quotes.
    // If not, do what you intend to do.
}

(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)

Moron
Won't `Split` remove strings between quotes?
Kobi
I don't think so, but there are differences between .Net versions. I remember I used this idea to be able quickly write a lexer+parser which actually worked. It might have not been optimal, but seemed good enough even for medium sized strings.
Moron
Please check that, your code does remove tokens between quotes - `Split` does not include the separator in its results.
Kobi
Which version of .NET are you using? Anyway, I will check the code I have and modify it soon.
Moron
I'm using 3.5, but I'm pretty sure all flavors will agree here. Here's a JavaScript version: `alert("hello 'crule' world".split(/'[^']+'/));`
Kobi
I checked the code I had. The pattern was in brackets (capturing parantheses, according to MSDN). That works in .Net 2.0 or greater. I have edited the answer.
Moron
Interesting. Thanks, and +1.
Kobi
A: 

Try this Regular Expression:

([']*)([a-z]+)([']*)

This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Tim C
+1  A: 

Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:

(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')

If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".

Dave
+1 - I think this is what the OP is looking for. This is similar to my answer, but more complex (I think OR works better here). Also, you had 999 reputation.
Kobi
+1  A: 

While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.

There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.

poke
yep, that's what I was thinking... but let's see if someone comes up with anything anyway;)
Shnitzel
nobody really gave me great answers.. atleast u were honest
Shnitzel
Not true at all. By your logic, regular expression cannot be use to match all numbers from a text, for example. But they can, quite easily, on all flavors. **You don't need a capturing groups for every string.**
Kobi
You won't be able to get each number as a separate match with one regular expression, no.
poke
why would you use groups then?
Kobi
+1  A: 

You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:

string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, @"(')([^']+)(')|(\w+)");

The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:

var tokens = from match in matches.Cast<Match>()
             from g in match.Groups.Cast<Group>().Skip(1)
             where g.Success
             select g.Value;

tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine

Kobi
Minor note: you can replace `\w+` with `\S+` to keep other characters.
Kobi