views:

3881

answers:

10

I was writing some Unit tests last week for a piece of code that generated some SQL statements.

I was trying to figure out a regex to match SELECT,INSERT and UPDATE syntax so I could verify that my methods were generating valid SQL, and after 3-4 hours of searching and messing around with various regex editors I gave up.

I managed to get partial matches but because a section in quotes can contain any characters it quickly expands to match the whole statment.

Any help would be appreciated, I'm not very good with regular expressions but I'd like to learn more about them.

By the way it's C# regex that I'm after.

Clarification

I dont want to need access to a database as this is part of a Unit test and I don't wan't to have to maintain a database to test my code. which may live longer than the project.

A: 

Have you tried the lazy selectors. Rather than match as much as possible, they match as little as possible which is probably what you need for quotes.

Orion Adrian
A: 

I am assuming you did something like ".*" try instead "[^"]*" that will keep you from eating the whole line. It still will give false positives on cases where you have \" inside your strings.

J.J.
No I specified \s space and hyphen and it went to the end of the line
Omar Kooheji
A: 

Off the top of my head: Couldn't you pass the generated SQL to a database and use EXPLAIN on them and catch any exceptions which would indicate poorly formed SQL?

JeeBee
Thats not useful for a unit test where I may not necessarily have access to a database, which I know will always be running
Omar Kooheji
Depends on the server, for example, MySQL allows explain only on select statements.
Nouveau
A: 

To validate the queries, just run them with SET NOEXEC ON, that is how Entreprise Manager does it when you parse a query without executing it.

Besides if you are using regex to validate sql queries, you can be almost certain that you will miss some corner cases, or that the query is not valid from other reasons, even if it's syntactically correct.

Pop Catalin
A: 

I suggest creating a database with the same schema, possibly using an embedded sql engine, and passing the sql to that.

Marcin
+10  A: 

Regular expressions can match languages only a finite state automaton can parse, which is very limited, whereas SQL is a syntax. It can be demonstrated you can't validate SQL with a regex. So, you can stop trying.

Pablo Marambio
Agreed. Some other approach is needed here other than regexp. There's no way to match legal syntax for a select statement: http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF01702
David Aldridge
A: 

I don't think that you even need to have the schema created to be able to validate the statement, because the system will not try to resolve object_name etc until it has successfully parsed the statement.

With Oracle as an example, you would certainly get an error if you did:

select * from non_existant_table;

In this case, "ORA-00942: table or view does not exist".

However if you execute:

select * frm non_existant_table;

Then you'll get a syntax error, "ORA-00923: FROM keyword not found where expected".

It ought to be possible to classify errors into syntax parsing errors that indicate incorrect syntax and errors relating to tables name and permissions etc..

Add to that the problem of different RDBMSs and even different versions allowing different syntaxes and I think you really have to go to the db engine for this task.

David Aldridge
+1  A: 

As far as I know this is beyond regex and your getting close to the dark arts of BnF and compilers.

http://savage.net.au/SQL/

Same things happens to people who want to do correct syntax highlighting. You start cramming things into regex and then you end up writing a compiler...

jms
+2  A: 

SQL is a type-2 grammar, it is too powerful to be described by regular expressions. It's the same as if you decided to generate C# code and then validate it without invoking a compiler. Database engine in general is too complex to be easily stubbed.

That said, you may try ANTLR's SQL grammars.

Constantin
A: 

There are ANTLR grammars to parse SQL. It's really a better idea to use an in memory database or a very lightweight database such as sqlite. It seems wasteful to me to test whether the SQL is valid from a parsing standpoint, and much more useful to check the table and column names and the specifics of your query.

MattMcKnight