tags:

views:

72

answers:

3

Hello!

I want to parse some C source files and find all strings ("foo").

Something like that works

String line = "myfunc(\"foo foo foo\", \"bar\");";
System.out.println(line);
String patternStr = "\\\"([^\"]+)\\\"";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher("");
String s;
if(line.matches(".*"+patternStr+".*"))
matcher.reset(line);
while(matcher.find()) {
    System.out.println(" FOUND "+matcher.groupCount()+" groups");
    System.out.println(matcher.group(1));
}

Until there are no "escape quoted strings" like

String line = "myfunc(\"foo \\\"foo\\\" foo\", \"bar\");";

I don't know how to create expression in Java like "without \" but with \." I've found something simmilar for C here http://wordaligned.org/articles/string-literals-and-regular-expressions

Thanks in advance.

A: 

Try the following:

String patternStr = "\"(([^\"\\\\]|\\\\.)*)\"";

(All I did was convert to Java the regexp from the article you mentioned: /"([^"\\]|\\.)*"/).

Eli Acherkan
It works but could you please explain me how does it work?Why there are four backslashes before closing group bracket ("]")?
skyman
I didn't attempt to understand fully how exactly it works - I just translated the regexp from the article to Java. To translate it, I needed to escape quotes and backslashes; therefore each " from the article turned into \" in Java, and each \ turned into \\. That's why the 2 backslashes before the `]` turned into 4.
Eli Acherkan
I didn't even try to do so, becouse this regex seemed to be so strange that it shouldn't work on Java ;]If anyone knows what's going on pleas tell me.
skyman
A: 

Between double-quotes, you want to allow an escape sequence or any character other than a double-quote. You want to test them in that order to allow the longer alternative the opportunity to match.

Pattern pattern = Pattern.compile("\"((\\\\.|[^\"])+)\"");
Matcher matcher = pattern.matcher(line);

while (matcher.find()) {
  System.out.println(" FOUND "+matcher.groupCount()+" groups");
  System.out.println(matcher.group(1));
}

Output:

 FOUND 2 groups
foo \"foo\" foo
 FOUND 2 groups
bar
Greg Bacon
+1  A: 

What about strings inside comments:

/* foo "this is not a string" bar */

and what about when a single double quote is in a comment:

/* " */ printf("text");

you don't want to capture "*/ printf(" as a string.

In other words: if the above could occur in your C code, use a parser instead of regex.

Bart Kiers
+1, regex have limits, you have reached one of these.
Clement Herreman