views:

83

answers:

3
+1  Q: 

Regex in Python

I have the following string:

schema(field1, field2, field3, field4 ... fieldn)

I need to transform the string to an object with name attribute as schema and the field names as another attribute which is a list.

How do I do this in Python with a regular expression?

+5  A: 

Are you looking for something like this?

>>> s = 'schema(field1, field2, field3, field4, field5)'
>>> name, _, fields = s[:-1].partition('(')
>>> fields = fields.split(', ')
>>> if not all(re.match(r'[a-z]+\d+$', i) for i in fields):
    print('bad input')

>>> sch = type(name, (object,), {'attr': fields})
>>> sch
<class '__main__.schema'>
>>> sch.attr
['field1', 'field2', 'field3', 'field4', 'field5']
SilentGhost
Thanks but I am looking for a solution that, in the process, also allows me to validate that the string is in the format specified above.
Yasmin Hanifa
Just wondering, do you have any specific reason for using `partition()` instead of `split(...,1)` or is it simply preference? Either way, +1 :)
WoLpH
@Yasmin: which is?
SilentGhost
name(a1, a2, a3, a4 upto an)
Yasmin Hanifa
@Yasmin: see my edit
SilentGhost
@WoLpH: `partition` is faster.
SilentGhost
+1 for using type() to create a class on the fly, never seen it used quite like that before :)
Aphex
Your idea is good but you should maybe compile the regex before using them in order to speed up the matching
Elenaher
@Elenaher: they're cached internally.
SilentGhost
@SilentGhost : Ok I didn't know. Thanks for the hint.
Elenaher
+1  A: 

You could use something like (in two rounds because python re doesn't support nested capture (thanks SilentGhost for pointing it out)) :

pattern = re.compile("^([a-z]+)\(([a-z,]*)\)$")

ret = pattern.match(s)

if ret==None:
    ...
else:
    f = ret.groups()
    name = f[0]
    args = f[1]

    arg_pattern = re.compile("^([a-z]+)(,[a-z]+)*$")

    ret2 = arg_pattern.match(args)

    # same checking as above
    if (ret2==None):
         ...
    else:
         args_f = ret2.groups()
Elenaher
it only works with two arguments, Python re doesn't support nested captures
SilentGhost
Does it work for fields > 2? I tried with four fields and print fields prints schema, first and last. Error?
Yasmin Hanifa
Yes indeed (cf. SilentGhost). I am trying to solve that ...
Elenaher
@Elenaher: you could fix it by splitting input string and checking each element independently (see my answer).
SilentGhost
+1  A: 

Regular expressions for things like that probably need tests:

import unittest

import re

# Verbose regular expression!  http://docs.python.org/library/re.html#re.X
p = r"""

(?P<name>[^(]+)         # Match the pre-open-paren name.
\(                      # Open paren
(?P<fields>             # Comma-separated fields
    (?:
        [a-zA-Z0-9_-]+
        (?:,\ )         # Subsequent fields must separated by space and comma
    )*
    [a-zA-Z0-9_-]+       # At least one field. No trailing comma or space allowed.
)

\)                      # Close-paren
"""

# Compiled for speed!
cp = re.compile(p, re.VERBOSE)

class Foo(object):
    pass


def validateAndBuild(s):
    """Validate a string and return a built object.
    """
    match = cp.search(s)
    if match is None:
        raise ValueError('Bad schema: %s' % s)

    schema = match.groupdict()
    foo = Foo()
    foo.name = schema['name']
    foo.fields = schema['fields'].split(', ')

    return foo



class ValidationTest(unittest.TestCase):
    def testValidString(self):
        s = "schema(field1, field2, field3, field4, fieldn)"

        obj = validateAndBuild(s)

        self.assertEqual(obj.name, 'schema')

        self.assertEqual(obj.fields, ['field1', 'field2', 'field3', 'field4', 'fieldn'])

    invalid = [
        'schema field1 field2',
        'schema(field1',
        'schema(field1 field2)',
        ]

    def testInvalidString(self):
        for s in self.invalid:
            self.assertRaises(ValueError, validateAndBuild, s)


if __name__ == '__main__':
    unittest.main()
David Eyk
how's that any different from my answer? except having all redundant testing code and an ugly regex?
SilentGhost
@David, how do I change the regex to make the space between the fields optional?
Yasmin Hanifa
On line 13, change `\ )` to `\ ?)`. This makes the escaped space optional. (See the section "Quantifiers" at <http://www.regular-expressions.info/reference.html>.
David Eyk
Because regexs are supposed to look like Perl (incomprehensible)
Nick T
I personally like to be able to comprehend my regex months later.
David Eyk
@David: my regex is 10 chars, I won't have problems comprehending it years later.
SilentGhost
@SilentGhost Well you "cheated" and used str.partition() instead of a regex. ;) Seriously, your solution is probably easier to read. Mine is more flexible (as validation goes), with a more complex regex (complex enough to require a test, imo). We both demonstrated some neat language features. Anyone who reads this is likely to learn something. I know I did.
David Eyk