views:

183

answers:

6

I am using Python 2.6.4.

I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.

Given select statement fields that could have several nested parenthese like "ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name," or the simple case of just "base_field_name" as a field, is it possible to use Python's re module to write a regex to extract base_field_name? If so, what would the regex look like?

+12  A: 

Regular expressions are not suitable for parsing "nested" structures. Try, instead, a full-fledged parsing kit such as pyparsing -- examples of using pyparsing specifically to parse SQL can be found here and here, for example (you'll no doubt need to take the examples just as a starting point, and write some parsing code of your own, but, it's definitely not too difficult).

Alex Martelli
+1 for remembering the world that well-parenthesized expressions (well, all Chomsky Type-2 languages) need more than a regexp to be properly parsed :)
Agos
+1  A: 
>>> import re
>>> string = 'ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name'
>>> rx = re.compile('^(.*?\()*(.+?)(,.*?)*(,|\).*?)*$')
>>> rx.search(string).group(2)
'base_field_name'
>>> rx.search('base_field_name').group(2)
'base_field_name'
Attila Oláh
PS: as told by Alex Martelli, you should use a real parser here. Anyway, if you only want a quick regex that just works, you can use this. But you should really use a parser, as this regex looks rather ugly :)
Attila Oláh
I'm not after something that looks pretty since it's a one off to get me the data I want so I can do other things with it. :) But thanks, my regex is rusty and I figured someone might know better.
TheObserver
+2  A: 

Either a table-driven parser as Alex Martelli suggests or a hand-written recursive descent parser. They're not hard and quite rewarding to write.

just somebody
+1  A: 

This may be good enough:

import re
print re.match(r".*\(([^\)]+)\)", "ltrim(to_char(field_name, format)))").group(1)

You would need to do further processing. For example pick up the function name as well and pull the field name according to function signature.

.*(\w+)\(([^\)]+)\)
Maxwell Troy Milton King
this prints 'field_name, format', not 'field_name' for me, and also doesn't work for the simple string 'field_name'.
Attila Oláh
How do you know every function is going to accept same arguments?
Maxwell Troy Milton King
@Maxwell Troy Milton King: You don't.
Attila Oláh
A: 

Do you really need regular expressions? To get the one you've got up there I'd use

  s[s.rfind('(')+1:s.find(')')].split(',')[0]

with 's' containing the original string.

Of course, it's not a general solution, but...

Heim
A compiled regex should be much faster than this. Well, I guess we're not in a hurry, but still, just for the sake of efficiency.
Attila Oláh
You may find that working directly with strings is faster.Depends heavily on the regex and the complexity of the equivalent code you need to write to do the thing without regex.Actually, have you tried comparing both them?
Heim
Oh, and in case I'd want to go for the equivalent regexp, I'd use "\(([^(),]+),", which is slightly faster than the purely string-based one.Both of them are one order of magnitude faster than your regexp...
Heim
You,re right. My regex is rusty.
Attila Oláh
+1  A: 

Here's a really hacky parser that does what you want.

It works by calling 'eval' on the text to be parsed, mapping all identifiers to a function which returns its first argument (which I'm guessing is what you want given your example).

class FakeFunction(object):
    def __init__(self, name):
        self.name = name
    def __call__(self, *args):
        return args[0]
    def __str__(self):
        return self.name

class FakeGlobals(dict):
    def __getitem__(self, x):
        return FakeFunction(x)

def ExtractBaseFieldName(x):
    return eval(x, FakeGlobals())

print ExtractBaseFieldName('ltrim(rtrim(to_char(base_field_name, format)))')
Paul Hankin