tags:

views:

1793

answers:

4

I asked this question a long time ago, I wish I had read the answers to When not to use Regex in C# (or Java, C++ etc) first!

I wish to use Regex (regular expressions) to get a list of all strings in my C# source code, including strings that have double quotes embedded in them.

This should not be hard, however before I spend time trying to build the Regex expression up, has anyone got a “pre canned” one already?

This is not as easy as it seems as first due to

  • “av\”d”
  • @”ab””cd”
  • @”ab”””
  • @”””ab”
  • etc
+3  A: 

The regular expression for finding C-style strings is:

"(?:[^"\\]+|\\.)*"

This will not take comments into consideration, so your best bet would be to remove all comments first, using the following regular expression:

/\*(?s:(?!\*/).)*\*/|//.*

Note that if you put the above regular expressions in a string you'll need to double all backslashes and escape any citation marks.

Update: Changed regular expression for comments to use DOTALL flag for multi-line comments.

Also, you may want to support literal strings, so use this instead of the other string regex:

@"(?:[^"]+|"")*"|"(?:[^"\\]+|\\.)*"

And a reminder: Don't use DOTALL as a global flag for any of these regular expressions, as it would break the single-line comments and single-line strings (normal strings are single-line, while literal strings can span multiple lines.)

Blixt
This regular experssion doesn't take @"" type string literals into consideration though.
DrJokepu
Just updated it for that too =)
Blixt
@"(?:[^"]+|"")*"|"(?:[^"\\]+|\\.)*" is not a valid C# string and as it does not have a @ within the regex I don't see how it is taking @"" type string literals into consideration
Ian Ringrose
Ah but you see, it's not a C# string, it's just the regular expression, as stated above. As a C# string it would be "@\"(?:[^\"]+|\"\")*\"|\"(?:[^\"\\\\]+|\\\\.)*\""
Blixt
A: 

Via www.regular-expressions.info:

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.

Lawrence Johnston
This regular experssion doesn't take @"" type string literals into consideration though
Ian Ringrose
+5  A: 

I am posting this as my answer so it stands out to other reading the questions.

As has been pointed out in the helpful comments to my question, it is clear that regex is not a good tool for finding strings in C# code. I could have written a simple “parser” in the time I spent reminding my self of the regex syntax. – (Parser is a over statement as there are no “ in comments etc, it is my source code I am dealing with.)

This seems to sums it up well:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

However until it breaks on my code I will use the regular expression Blixt has posted, but if it give me problems I will not spend match time trying to fix it before writing my own parser. E.g as a C# string it is

@"@Q(?:[^Q]+|QQ)*Q|Q(?:[^Q\\]+|\\.)*Q".Replace('Q', '\"')

Update, the above regEx had problem, so I just wrote my own parser, including writing unit tests it took about 2 hours to write the parser. That's I lot less time then I spend just trying to find (and test) a pre-canned Regex on the web.

The problem I see to have, is I tend to avoid Regex and just write the string handling code my self, then have a lot of people claim I am wasting the client’s money by not using Regex. However whenever I try to use Regex what seems like a simple match pattern becomes match harder quickly. (None the on-line articles on using Regex in .net that I have read, have a good instruction that make it clear when NOT to use Regex. Likewise with it’s MSDN documentation)

Lets see if we can help solve this problem, I have just created a stack overflow questions “When not to use Regex

Ian Ringrose