tags:

views:

284

answers:

4

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.

Given select statement fields that could have several nested parenthese like:

ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name,

Or the simple case of just base_field_name as a field, what would the regex look like in Perl?

A: 

How about splitting each line into terms (replace every parenthesis, comma and space with a newline), then sorting:

perl -ne's/[(), ]/\n/g; print' < textfile | sort -u

You'll end up with a lot of content like:

fieldname1
fieldname1
formatstring
ltrim
rtrim
t_char
Ether
+11  A: 

Don't try to write a regex parser (though perl regexes can handle nested patterns like that), use SQL::Statement::Structure.

ysth
+1 for SQL::Statement and friends. While it can't handle every extreme of contorted SQL, it is at least the right kind of tool for this job.
pilcrow
+2  A: 

Why not ask the target database itself how it would interpret the queries?

In perl, one can use the DBI to query the prepared representation of a SQL query. Sometimes this is database-specific: some drivers (under the perl DBD:: namespace) support their RDBMS' idea of describing statements in ways analogous to the RDBMS' native C or C++ API.

It can be done generically, however, as the DBI will put the names of result columns in the statement handle attribute NAME. The following, for example, has a good chance of working on any DBI-supported RDBMS:

use strict;
use warnings;
use DBI;

use constant DSN => 'dbi:YouHaveNotToldUs:dbname=we_do_not_know';

my $dbh = DBI->connect(DSN, ..., { RaiseError => 1 });

my $sth;
while (<>) {
  next unless /^SELECT/i;   # SELECTs only, assume whole query on one line
  chomp;
  my $sql = /\bWHERE\b/i ? "$_ AND 1=0" : "$_ WHERE 1=0"; # XXX ugly!
  eval {
    $sth = $dbh->prepare($sql);  # some drivers don't know column names
    $sth->execute();             # until after a successful execute()
  };
  print $@, next if $@;  # oops, problem with that one
  print join(', ', @{$sth->{NAME}}), "\n";
}

The XXX ugly! bit there tries to append an always-false condition on the SELECT, so that the SQL engine doesn't have to do any real work when you execute(). It's a terribly naive approach -- that /\bWHERE\b/i test is no more correctly identifying a SQL WHERE clause than simple regexes correctly parse out SELECT field names -- but it is likely to work.

pilcrow
Why not? Because the world isn't ideal and neither is life. I only have access to the data output (sans headers) and the SQL query used to generate the output.
TheObserver
+1  A: 

In a somewhat related problem at the office I used:

my @SqlKeyWordList = qw/select from where .../; # (1)

my @Candidates =split(/\s/,$SqlSelectQuery);      # (2)

my %FieldHash;                                  # (3)
for my $Word (@Candidates)  { 
   next if grep($word,@SqlKeyWordList);
   $FieldHash($Word)++;
} 

Comments:

  1. SqlKeyWordList contains all the SQL keywords that are potentially in the SQL statement (we use MySQL, there are many SQL dialiects, choosing/building this list is work, look at my comments below!). If someone decided to use a keyword as a field name, you will need a regex after all (beter to refactor the code).
  2. Split the SQL statement into a list of words, this is the trickiest part and WILL REQUIRE tweeking. For now it uses Perl notion of "space" (=not in word) to split.
    Splitting the field list (select a,b,c) and the "from" portion of the SQL might be advisabel here, depends on your SQL statements.
  3. %MyFieldHash will contain one entry per select field (and gunk, until you validated your SqlKeyWorkList and the regex in (2)

Beware

  • there is nothing in this code that could not be done in Python.
  • your life would be much easier if you can influence the creation of said SQL statements. (e.g. make sure each field is written to a comment)
  • there are so many things that can/will go wrong in this parsing approach, you really should sidestep the issue entirely, by changing the process (saves time in the long run).
  • this is the regex we use at the office
   my @Candidates=split(/[\s
                  \(
                  \)
                  \+
                  \,
                  \*
                 \/
                  \-
                  \n
                  \
                  \=
                  \r
                 ]+/,$SqlSelectQuery
               );

lexu
clever, if fragile. `select 42 'let x=x'` breaks it two ways.
ysth
@ysth: (s)he who writes that, deserves what (s)he gets (and then some)
lexu