tags:

views:

176

answers:

4

Hi, I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.

Currently I am thinking of the description in YAML format. So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:

# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)

etc.

Does anyone think something of this sort would be of wider use? Do you know already existing packages that can do it? What are the limitations that you see to this approach? Does anyone think, having the declarative string in code, would make it more maintainable?

+5  A: 

This is actually pretty similar (identical?) to how a lexer/parser works. If you had a defined grammar then you could probably write a parser with not too much trouble. For instance, you could write something like this:

<expression> :: == <rule> | <rule> <expression> | <rule> " followed by " <expression>
<rule>       :: == <val> | <qty> <val>
<qty>        :: == "literal" | "one" | "one of" | "one or more of" | "zero or more of"
<val>        :: == "a" | "b" | "c" | "d" | ... | "Z" | 

That's nowhere near a perfect description. For more info, take a look at this BNF of the regex language. You could then look at lexing and parsing the expression.

If you did it this way you could probably get a little closer to Natural Language/English versions of regexes.


I can see a tool like this being useful, but as was previously said, mainly for beginners. The main limitation to this approach would be in the amount of code you have to write to translate the language into regex (and/or vice versa). On the other hand, I think a two-way translation tool would actually be more ideal and see more use. Being able to take a regex and turn it into English might be a lot more helpful to spot errors.

Of course it doesn't take too long to pickup regex as the syntax is usually terse and most of the meanings are pretty self explanatory, at least if you use | or || as OR in your language, and you think of * as multiplying by 0-N, + as adding 0-N.

Though sometimes I wouldn't mind typing "find one or more 'a' followed by three digits or 'b' then 'c'"

Wayne Werner
In reply to your `Being able to take a regex and turn it into English might be a lot more helpful to spot errors.`, try the `re.DEBUG` parameter with python in repl mode.
Daenyth
@Daenyth - I'm aware of that mode, though I've not had cause to use it, and I can't say it's much better than the original regex, unless it's an extremely complicated regex.
Wayne Werner
+4  A: 

Please take a look at pyparsing. Many of the issues that you describe with RE's are the same ones that inspired me to write that package.

Here are some specific features of pyparsing from the O'Reilly e-book chapter "What's so special about pyparsing?".

Paul McGuire
You beat me by a second! BTW, thanks for writing pyparsing :)
Roberto Bonvallet
+2  A: 

maybe not exactly what you are asking for, but there is a way how to write regexes more readable way (VERBOSE, shortly X flag):

rex_name = re.compile("""
    [A-Za-z]    # first letter
    [a-z]+      # the rest
""", re.X)

rex_name.match('Joe')
mykhal
+1  A: 

For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.

For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.

You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:

from refunc import *

pattern = Compile(
    'a',
    Capture(
        Choices('b', 'c'),
        N_of( 'd', 1, Infin() ),
        N_of( 'e', 0, Infin() ),
    ),
    Look_ahead('foo'),
)

But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.

pattern = Compile(
    'a',
    Capture(
        '[bc]',
        'd+',
        'e*',
    ),
    Look_ahead('foo'),
)

I would add that in my experience, regular expressions are about leaning a thought process. Getting comfortable with the syntax is the easy part.

FM