views:

42

answers:

1

I've been working on this for a few hours now and can't find any help on it. Basically, I'm trying to strip a SQL string into various parts (fields, from, where, having, groupBy, orderBy). I refuse to believe that I'm the first person to ever try to do this, so I'd like to ask for some advise from the StackOverflow community. :)

To understand what I need, assume the following SQL string:

select * from table1 inner join table2 on table1.id = table2.id 
where field1 = 'sam' having table1.field3 > 0 
group by table1.field4 order by table1.field5 

I created a regular expression to group the parts accordingly:

select\s+(?<fields>.+)\s+from\s+(?<from>.+)\s+where\s+(?<where>.+)\s+having\s+(?<having>.+)\s+group\sby\s+(?<groupby>.+)\s+order\sby\s+(?<orderby>.+)

This gives me the following results:

fields => *
from => table1 inner join table2 on table1.id = table2.id
where => field1 = 'sam'
having => table1.field3 > 0
groupby => table1.field4
orderby => table1.field5 

The problem that I'm faced with is that if any part of the SQL string is missing after the 'from' clause, the regular expression doesn't match.

To fix that, I've tried putting each optional part in it's own (...)? group but that doesn't work. It simply put all the optional parts (where, having, groupBy, and orderBy) into the 'from' group.

Any ideas?

+2  A: 

It is not possible to do this perfectly using .Net regular expressions; you need a stack-based parser.

If you don't see why, consider the following two valid queries:

SELECT 'I\'m from Kansas', 'where the grass is greener'     
FROM Minnesota 
WHERE Grass = 'Blue'

SELECT 
    ID,
    Name IN (SELECT Name From Employees WHERE Rank > 4),
    Grade
FROM Employees
WHERE Rank < 4

EDIT

To answer the question:

new Regex(@"

    ^
    select\s+(?<fields>.+?)
        \s+ from       \s+ (?<from>    .+?)
    (?: \s+ where      \s+ (?<where>   .+?))?
    (?: \s+ having     \s+ (?<having>  .+?))?
    (?: \s+ group\s+by \s+ (?<groupby> .+?))?
    (?: \s+ order\s+by \s+ (?<orderby> .+ ))?
    $", 

        RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

(Re-tested)
This will not handle nested queries or string literals

SLaks
Thank you for your answer, and I definitely see your point. However, I don't need to support nested sql queries for my requirement...
Luc
Your edit does not appear to be valid. I tested it against the query that I posted in my question and each part only returns a single character when using .+? -- Regards
Luc
@Luc: You're right; I forgot to anchor it. It works now.
SLaks
Genius! Thank you @SLaks! :)
Luc