ansaurus

Question

Answer 1

A:

I didn't quite figure out your question. But "^$" would mean a line that ends next to where it starts, therefore an empty line. Does that help?

Fernando 2009-03-24 15:48:15

I think he's looking for hte regex equivalent to StringSplitOptions.RemoveEmptyEntries.

Joel Coehoorn 2009-03-24 15:58:55

This is what I tried to say in my first line - sorry I am not a native English speaker.

weismat 2009-03-24 16:38:07

No problem, neither am I.

Fernando 2009-03-24 17:13:26

Answer 2

+3 A:

Joel Coehoorn 2009-03-24 15:57:35

Ok - this but this means that I add an additional LINQ query when doing the parsing. Not really nice - but at least I have not missed something trivial from the documentation.

weismat 2009-03-24 16:12:54

Isn't "(=)|({)|(})|(\|)" a very complicated way of saying "[={}|]" in the first place? Or am I missing something fundamental here? The character class expression should be a lot faster.

Tomalak 2009-03-24 17:35:12

And *if* the character class expression is what the OP is after, then there is String.Split(Char[]) to avoid regex altogether.

Tomalak 2009-03-24 17:36:59

Put that in an answer and I'll upvote.

Joel Coehoorn 2009-03-24 18:14:46

I need the keep the tokens for the parsing- thus I need the Regex or is there a way with the String class to keep the tokens?

weismat 2009-03-24 18:51:03

@Joel Coehoorn: Done.

Tomalak 2009-03-24 19:32:55

@weismat: Please see my answer.

Tomalak 2009-03-24 19:33:41

Answer 3

A:

As for efficiency: If the stars are lucky, you can gain some performance by compiling the regex:

Regex r = new Regex ("<regex goes here>", RegexOptions.Compiled);

phresnel 2009-03-24 16:02:36

Answer 4

+1 A:

To remove spaces from a string just do this

Regex exp = new Regex(@"\s+");
string test = "{ key1 = { key2= xx } | key3 = y | key4 = z }";
string result = test.Replace(exp, string.Empty);

Or you could also do the following (did not test which one works faster)

Regex.Replace(test, " ", string.Empty, RegexOptions.Compiled)

Here is what Jeff Atwood (incidentally one of the creators of StackOverFlow has to say about compiled regex )

After this you can user your split code to put the keys into the string array.

Binoj Antony 2009-03-24 16:28:11

To OP mentioned this is something that will happen over and over and over in quick succession, therefore compiled is the correct choice.

Joel Coehoorn 2009-03-24 16:35:07

Better to show the door, than to carry them through the door!

Binoj Antony 2009-03-24 16:44:22

I will change it to compiled, but my issue are the empty string elements in the result, not removing white spaces before...

weismat 2009-03-24 16:45:05

The empty elements are in independently of removing white spaces before or not...

weismat 2009-03-24 16:46:32

Answer 5

+2 A:

Maybe not a full solution to the question, but I have a few remarks for the problem at hand (tokenizing a string):

the original regex:    (=)|({)|(})|(\|)
is equivalent to:      (=|{|}|\|)
is equivalent to:      ([={}|])

All of the above expressions return the same 21 elements, but they perform differently. I set up a quick test going over 100,000 iterations of Split() operations using pre-built Regex objects with RegexOptions.Compiled and the Stopwatch class.

regex #1 takes 2002ms on my hardware
regex #2 takes 1691ms
regex #3 takes 1542ms
regex #4 takes 1839ms (that's the one below)

YMMV.

However, the desired elements can still be surrounded by white space. I figure this is undesired as well, so the regex I would split on would be this:

\s*([={}|])\s*

The returned elements are:

["", "{", "key1", "=", "", "{", "key2", "=", "xx", "}", "", "|", "key3", "=", "y", "|", "key4", "=", "z", "}", ""]

The few remaining empty strings should not pose a big problem performance-wise when iterating the array and can be taken care of (read: ignored) when they are encountered.

EDIT: If you measure performance it is possible that you find splitting on ([={}|]) and trimming the array elements "manually" is faster than splitting on \s*([={}|])\s*. Just try what works better for you.

Tomalak 2009-03-24 19:30:34

promised upvote completed

Joel Coehoorn 2009-03-24 20:04:11

As an additional note: considering there are still 4 empty strings after the split, my .Where() code is probably the best way to filter them out.

Joel Coehoorn 2009-03-24 20:09:07

I get 21 elements also with the first regular expression, but I gain 4 ms on 1000 iterations when measuring the performance with StopWatch. I did not understand your last suggestion though - can you give me exact code - I get bad escape sequence.

weismat 2009-03-25 04:56:18

Regular expressions and C# strings are two different things. You must escape backslashes in C# strings. I get back 51 elements when splitting on your original expression. I tried it with JavaScript regex though. Maybe there is a implementation difference between C# and JS in this regard?

Tomalak 2009-03-25 07:51:17

I just tried it with C# and sure enough the Regex #1 returns 21 elements as well. I implemented a quick test in C#, I posted the results above.

Tomalak 2009-03-25 08:41:15

The regular expressions are not equivalent if you want to use matches of the groups.

Gumbo 2009-03-25 15:38:15

@Gumbo: They are used in Split() context. There is no using of the match groups.

Tomalak 2009-03-25 18:14:20

Answer 6

A:

Rather than splitting the string using the regex you could modify your regular expression and return a match collection. Something like this:

string test = "{ key1 = { key2= xx } | key3 = y | key4 = z }";

Regex regex = new Regex("[={}|]|[^\\s={}|]{1,}");
MatchCollection matches = regex.Matches(test);

string[] help = new string[matches.Count];

for (int index = 0; index < matches.Count; index++)
{
    help[index] = matches[index].Value;                
}

This will return the same as your regular expression minus the empty (white space) elements in the final array.

lexx 2009-03-25 13:42:29

Thanks for the comment - I will try this possibility as well when benchmarking my parser with or without the empty string.

weismat 2009-03-25 14:07:25

Answer 7

A:

So you want multiple occurrences of delimiters between values to be matched only once.

\s*[{}=|][\s{}=|]*

This should match, in this order, any amount of whitespace, one delimiter, and any amount of both whitespace and further delimiters.

Adding C# string escapes and a compilation declaration:

Regex regex = new Regex("\\s*[{}=|][\\s{}=|]*");

Svante 2009-03-25 15:26:53

ansaurus

tags:

views:

answers:

Regex - Removing empty strings

related questions