views:

1562

answers:

6

Hello,

I try to keep it brief and concise. I have to write a program that takes queries in SQL form and searches an XML. Right now I am trying to disassemble a string into logical pieces so I can work with them. I have a string as input and want to get a MatchCollection as output.

Please not that the test string below is of a special format that I impose on the user to keep things simple. Only one statement per line is permitted and nested queries are excluded-

string testString = "select apples \n from dblp \r where we ate \n group by all of them \r HAVING NO SHAME \n";

I use Regex with the following pattern:

Regex reg = new Regex(@"(?<select> \A\bselect\b .)" +  
                                  @"(?<from> ^\bfrom\b .)" +
                                  @"(?<where> ^\bwhere\b .)" +
                                  @"(?<groupBy> ^\bgroup by\b .)" +
                                  @"(?<having> ^\bhaving\b .)"
                                  , RegexOptions.IgnoreCase | RegexOptions.Multiline
                                  );

As far as I know this should give me matches for every group with the test string. I would be looking for an exact match of "select" at the start of each line followed by any characters except newlines.

Now I create the collection:

MatchCollection matches = reg.Matches(testString);

To makes sure it worked I used a foreach and printed the matches like:

foreach(Match match in matches)
{
    Console.WriteLine("Select: {0}", match.Groups["select"]);
   //and so on
}

The problem is that the collection is always empty. There must be a flaw in the Regex somewhere but I am to inexperienced to find it. Could you please assist me? Thank you very much!

A: 

Thank you very much! Alas my problem persists!

I tried using .* instead of just . until I was told that . would even mathc multiple character. I have no doubt that this could be a problem but even when replacing it I get no result.

I fail to see why it is so difficult to match a line starting with a defined word and having any characters appended to it until the regex finds a newline. Seems to me that this should be a relatively easy task.

+3  A: 

I think you need to explicitly match the line terminators, as well as handle spaces better as others have suggested. Assuming the user can choose between \r and \n, try

@"(?<select>\Aselect .+)[\n\r]" +
@"(?<from>\s*from .+)[\n\r]" +
@"(?<where>\s*where .+)[\n\r]" +
@"(?<groupBy>\s*group by .+)[\n\r]" +
@"(?<having>\s*having .+)[\n\r]"

As long as you are using regular expressions, you probably want to do a bit better:

@"\Aselect (?<select>.+)[\n\r]" +
@"\s*from (?<from>.+)[\n\r]" +
@"\s*where (?<where>.+)[\n\r]" +
@"\s*group by (?<groupBy>.+)[\n\r]" +
@"\s*having (?<having>.+)[\n\r]"

My biggest problem with regular expressions for this sort of use is that the only error message you can give is that things failed. You can't give the user any further information about what they did wrong.

David Norman
A: 

There may be a problem with the newline matching: is it LF (Unix standard), CR (MacOS), or CR LF (Windows)? If you don't know, perhaps you should match it with: [\n\r]+

edit: You included some whitespace in your test string, surrounding the newlines, that you don't account for in your rexex.

(?<from>^\s*from\b.*[\n\r]+$)
Svante
A: 

As you said, it's easy enough to match the keyword(s) and then use (.+) to match the rest of the line. But you have to match all of the intervening characters, and you aren't doing that. (The ^ line anchor matches the position following the line separator, not the separator itself.) You can use \s+ to consume the line separator as well as any leading whitespace on the next line.

@"select\s+(?<select>.+)\s+" +
@"from\s+(?<from>.+)\s+" +
@"where\s+(?<where>.+)\s+" +
@"group by\s+(?<groupBy>.+)\s+" +
@"having\s+(?<having>.+)";

I also rearranged things so that the SQL keywords aren't captured; that seems redundant, since you're using named groups.

Alan Moore
A: 

I haven't tried to build a working regex for you, but I can see several issues. Others pointed out the first two issues, but not the third one.

  1. You can't use a single dot to match the variable parts such as "apples". Try \w+ or \S+
  2. Your string has embedded line breaks. You need to match those with [\r\n]+ or \s+
  3. The .NET regex engine treats \n as a line break, but NOT \r or \r\n. Thus, ^ will match after \n, but NOT after \r. If you do step 2, you don't need the anchors anyway, so remove them.
Jan Goyvaerts
A: 

Thank you all very much. You helped me a lot!