tags:

views:

1565

answers:

7

I have written the folllowing regex and would like to get empty strings removed automatically and could not find any equivalent to RemoveEmptyEntries for Regex which I found only for the Split method in string.

    string test = "{ key1 = { key2= xx } | key3 = y | key4 = z }";
    string[] help = Regex.Split(test, "(=)|({)|(})|(\\|)");

The result string array contains elements which are empty. I would like to run the regular expression without yielding any empty strings contained in the result.

I will run this code very, very frequently - thus I need it as efficient as possible. Updates: As this is a parser I need to keep the tokens and I found only a way with Regex to keep them.

A: 

I didn't quite figure out your question. But "^$" would mean a line that ends next to where it starts, therefore an empty line. Does that help?

Fernando
I think he's looking for hte regex equivalent to StringSplitOptions.RemoveEmptyEntries.
Joel Coehoorn
This is what I tried to say in my first line - sorry I am not a native English speaker.
weismat
No problem, neither am I.
Fernando
+3  A: 
Joel Coehoorn
Ok - this but this means that I add an additional LINQ query when doing the parsing. Not really nice - but at least I have not missed something trivial from the documentation.
weismat
Isn't "(=)|({)|(})|(\|)" a very complicated way of saying "[={}|]" in the first place? Or am I missing something fundamental here? The character class expression should be a lot faster.
Tomalak
And *if* the character class expression is what the OP is after, then there is String.Split(Char[]) to avoid regex altogether.
Tomalak
Put that in an answer and I'll upvote.
Joel Coehoorn
I need the keep the tokens for the parsing- thus I need the Regex or is there a way with the String class to keep the tokens?
weismat
@Joel Coehoorn: Done.
Tomalak
@weismat: Please see my answer.
Tomalak
A: 

As for efficiency: If the stars are lucky, you can gain some performance by compiling the regex:

Regex r = new Regex ("<regex goes here>", RegexOptions.Compiled);
phresnel
+1  A: 

To remove spaces from a string just do this

Regex exp = new Regex(@"\s+");
string test = "{ key1 = { key2= xx } | key3 = y | key4 = z }";
string result = test.Replace(exp, string.Empty);

Or you could also do the following (did not test which one works faster)

Regex.Replace(test, " ", string.Empty, RegexOptions.Compiled)

Here is what Jeff Atwood (incidentally one of the creators of StackOverFlow has to say about compiled regex )

After this you can user your split code to put the keys into the string array.

Binoj Antony
To OP mentioned this is something that will happen over and over and over in quick succession, therefore compiled is the correct choice.
Joel Coehoorn
Better to show the door, than to carry them through the door!
Binoj Antony
I will change it to compiled, but my issue are the empty string elements in the result, not removing white spaces before...
weismat
The empty elements are in independently of removing white spaces before or not...
weismat
+2  A: 

Maybe not a full solution to the question, but I have a few remarks for the problem at hand (tokenizing a string):

the original regex:    (=)|({)|(})|(\|)
is equivalent to:      (=|{|}|\|)
is equivalent to:      ([={}|])

All of the above expressions return the same 21 elements, but they perform differently. I set up a quick test going over 100,000 iterations of Split() operations using pre-built Regex objects with RegexOptions.Compiled and the Stopwatch class.

  • regex #1 takes 2002ms on my hardware
  • regex #2 takes 1691ms
  • regex #3 takes 1542ms
  • regex #4 takes 1839ms (that's the one below)

YMMV.

However, the desired elements can still be surrounded by white space. I figure this is undesired as well, so the regex I would split on would be this:

\s*([={}|])\s*

The returned elements are:

["", "{", "key1", "=", "", "{", "key2", "=", "xx", "}", "", "|", "key3", "=", "y", "|", "key4", "=", "z", "}", ""]

The few remaining empty strings should not pose a big problem performance-wise when iterating the array and can be taken care of (read: ignored) when they are encountered.

EDIT: If you measure performance it is possible that you find splitting on ([={}|]) and trimming the array elements "manually" is faster than splitting on \s*([={}|])\s*. Just try what works better for you.

Tomalak
promised upvote completed
Joel Coehoorn
As an additional note: considering there are still 4 empty strings after the split, my .Where() code is probably the best way to filter them out.
Joel Coehoorn
I get 21 elements also with the first regular expression, but I gain 4 ms on 1000 iterations when measuring the performance with StopWatch. I did not understand your last suggestion though - can you give me exact code - I get bad escape sequence.
weismat
Regular expressions and C# strings are two different things. You must escape backslashes in C# strings. I get back 51 elements when splitting on your original expression. I tried it with JavaScript regex though. Maybe there is a implementation difference between C# and JS in this regard?
Tomalak
I just tried it with C# and sure enough the Regex #1 returns 21 elements as well. I implemented a quick test in C#, I posted the results above.
Tomalak
The regular expressions are not equivalent if you want to use matches of the groups.
Gumbo
@Gumbo: They are used in Split() context. There is no using of the match groups.
Tomalak
A: 

Rather than splitting the string using the regex you could modify your regular expression and return a match collection. Something like this:

string test = "{ key1 = { key2= xx } | key3 = y | key4 = z }";

Regex regex = new Regex("[={}|]|[^\\s={}|]{1,}");
MatchCollection matches = regex.Matches(test);

string[] help = new string[matches.Count];

for (int index = 0; index < matches.Count; index++)
{
    help[index] = matches[index].Value;                
}

This will return the same as your regular expression minus the empty (white space) elements in the final array.

lexx
Thanks for the comment - I will try this possibility as well when benchmarking my parser with or without the empty string.
weismat
A: 

So you want multiple occurrences of delimiters between values to be matched only once.

\s*[{}=|][\s{}=|]*

This should match, in this order, any amount of whitespace, one delimiter, and any amount of both whitespace and further delimiters.

Adding C# string escapes and a compilation declaration:

Regex regex = new Regex("\\s*[{}=|][\\s{}=|]*");
Svante