tags:

views:

1291

answers:

9

Hi all,

I am not that hot at regular expressions and it has made my little mind melt so what.

I am trying to find all the tables names in a query. So say I have the query:

SELECT one, two, three FROM table1, table2 WHERE X=Y

I would like to pull out "table1, table2" or "table1" and "table2"

But what if there is no where statement. It could be the end of the file, or there could be a group by or an order by etc. I know "most" of the time this will not be an issue but I don't like the idea of coding for "most" situations and knowing I have left a hole that could cause things to go wrong later.

Is this a doable Regex expression? Am I being a Regex pleb?

(P.S. this will be done in C# but presume that doesn't matter much).

Thanks

Jon

+1  A: 

It's definitely not easy.

Consider subqueries.

select
  *
from
  A
  join (
    select
       top 5 *
    from
      B)
    on B.ID = A.ID
where
  A.ID in (
    select
      ID
    from
      C
    where C.DOB = A.DOB)

There are three tables used in this query.

Jason Lepack
+1  A: 

I think it would be easier to tokenize the string and look for SQL keywords that could bound the table names. You know the names will follow FROM, but they could be followed by WHERE, GROUP BY, HAVING, or no keyword at all if they're at the end of the query.

Bill the Lizard
+11  A: 

RegEx isn't very good at this, as it's a lot more complicated than it appears:

  • What if they use LEFT/RIGHT INNER/OUTER/CROSS/MERGE/NATURAL joins instead of the a,b syntax? The a,b syntax should be avoided anyway.
  • What about nested queries?
  • What if there is no table (selecting a constant)
  • What about line breaks and other whitespace formatting?
  • Alias names?

I could go on.

What you can do is look for an sql parser, and run your query through that.

Joel Coehoorn
I think the real deal killer is going to be views. There is going to be no practical way to parse the underlying table names of any views included in the query.
JohnFx
+1  A: 

Everything said about the usefulness of such a regex in the SQL context. If you insist on a regex and your SQL statements always look like the one you showed (that means no subqueries, joins, and so on), you could use

FROM\s+([^ ,]+)(?:\s*,\s*([^ ,]+))*\s+
Stefan Gehrig
A: 

I found this site that has a GREAT parser!

http://www.sqlparser.com/

well worth it. Works a treat.

Jon
+1  A: 

Hi, I'm pretty late to the party however I thought I would share a regex I am currently using to analyse all our database objects and I disagree with the sentiment that it is not possible to do this using one.

The regex has a few assumptions

1) You are not using the A,B join syntax style

2) Whatever regex parser you are using supports ignore case.

3) You're analyzing, selects, joins, updates, deletes and truncates. It doesn't support the aforementioned MERGE/NATURAL because we don't use them, however I'm sure further support wouldn't be difficult to add.

I am keen to know what type of transaction the table is part of so I have included Named Capture groups to tell me.

Now I've not used regex for a long time so there are probably improvements that can be made however so far in all my testing this is accurate.

\bjoin\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bfrom\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bupdate\s+(?<Update>[a-zA-Z\._\d]+)\b|\binsert\s+(?:\binto\b)?\s+(?<Insert>[a-zA-Z\._\d]+)\b|\btruncate\s+table\s+(?<Delete>[a-zA-Z\._\d]+)\b|\bdelete\s+(?:\bfrom\b)?\s+(?<Delete>[a-zA-Z\._\d]+)\b
MrEdmundo
A: 

Constructing a regular expression is going to be the least of your problems. Depending on the flavor of SQL you expect to support with this code, the number of ways you can reference a table in a SQL statement is staggering.

PLUS, if the query includes a reference to a view or UDF, the information about what underlying tables won't even be in the string at all making it completely impractical to get that information by parsing it. Also, you'd need to be smart about detecting temporary tables and excluding them from your results.

If you must do this, a better approach would be to make use of the APIs to the particular database engine that the SQL was intended for. For example you could create a view based on the query and then use the DB Server api to detect dependencies for that view. The DB engine is going to be able to parse it much more reliably than you ever will without an enormous effort to reverse engineer the query engine.

If, by chance, you are working with SQL Server, here is an article about detecting dependencies on that platform: Finding Dependencies in SQL Server 2005

JohnFx
A: 

There is a similar problem for me but I am not using nested queries etc..

What I use is

Select a,b,c from table1 d,table2 e where

Select a,b,c from table1,table2 where

Select a,b,c from table1 where

(I can also replace a,b,c with *)

and simple forms of

insert into,replace into, delete from, update.

None of these below exists in my query

" * What if they use LEFT/RIGHT INNER/OUTER/CROSS/MERGE/NATURAL joins instead of the a,b syntax? The a,b syntax should be avoided anyway. * What about nested queries? * What if there is no table (selecting a constant) * What about line breaks and other whitespace formatting? * Alias names? "

faruk hakan
you should really post this in its own question, and if wanted link to this this one. Please don't post multiple questions on a single thread.
Jon
A: 

This will pull out a table name on an insert Into query:

(?<=(INTO)\s)[^\s]*(?=\(())

The Following will do the same but with a select including joins

(?<=(from|join)\s)[^\s]*(?=\s(on|join|where))

Finally going back to an insert if you want to return just the values that are held in an insert query use the following Regex

(?i)(?<=VALUES[ ]*\().*(?=\))

I know this is an old thread but it may assist someone else looking around

Enjoy

Psymon25