tags:

views:

58

answers:

3

I need to parse out writeln("test"); from a string.
I was using (?<type>writeln)\((?<args>[^\)]*)\); as the regex, but this isn't perfect, if you try and parse writeln("heloo :)"); or something similar, the regex won't parse it (because of the ')' in the quotes). Is there a way to register that since the ')' is in the quote marks, the regex should ignore it, and look for the next ')'?

Thanks,
Max

+1  A: 

You've encountered the sort of problem you get using regexes to parse non-regular languages.

That being said, try:

(?<type>writeln)\((?<args>("[^"]*"|))\);

It's not perfect but nothing will be.

cletus
That gets around the example above, but not the case where you have an escaped quote: `writeln("hello \"world\"");`. So yeah, like you said, regex isn't a great solution for this. If you're doing lots of parsing, use a proper parser and grammar.
Drew Noakes
+2  A: 

Why not write a little parser for this? Just loop through the characters and have a simple state machine for parsing.

This kind of problem is hard to do in regular expressions since the problem (grammar) is not regular. Look up on parsing HTML with regex in SO ;)

BUT: If you control your input to a certain extent, then you might just be able to get away with regexes. See other answers here for "good enough" ways to do it.

This basically boils down to:

  1. decide how deep the rabbit hole goes (how much "recursion" you want to simulate)
  2. create an alternative (branch) regex for each such recursion
  3. stab your eyes out the next time you need to change regex

I do this all the time. And I hate myself for it!

Daren Thomas
+1  A: 

The following will match patterns like writeln("hello :) \"world\"!");

string regex = "(?<type>writeln)\\(\"(?<args>(\\\\\"|[^\"])*)\"\\);";

I'm assuming this is only for single arguments.

Peet Brits