views:

246

answers:

3

In PostgreSQL, there is a BLOB datatype called bytea. It's just an array of bytes.

bytea literals are output in the following way:

'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'

See PostgreSQL docs for full definition of the format.

I'm trying to construct a Perl regular expression which will match any such string.
It should also match standard ANSI SQL string literals, like 'Joe', 'Joe''s Mom', 'Fish Called ''Wendy'''
It should also match backslash-escaped variant: 'Joe\'s Mom', .

First aproach (shown below) works only for some bytea representations.

s{ '               # Opening apostrophe
    (?:            # Start group
        [^\\\']    #   Anything but a backslash or an apostrophe
    |              #  or
        \\ .       #   Backslash and anything
    |              #  or
        \'\'       #   Double apostrophe
    )*             # End of group
  '                # Closing apostrophe
}{LITERAL_REPLACED}xgo;

For other (longer ones, with many escaped apostrophes, Perl gives such warning:

Complex regular subexpression recursion limit (32766) exceeded at ./sqa.pl line 33, <> line 1.

So I am looking for a better (but still regex-based) solution, it probably requires some regex alchemy (avoiding backreferences and all).

A: 

First of all, it seems like you're trying to so two very different things in one regexp:

  • Matching it for correctness.
  • Unquoting it.

To match it, you could try something like his:

m{  ^              # Start of string
    '              # Opening apostrophe
    (?>            # Start non-backtracking group
        [^\\\']    #   Anything but a backslash or an apostrophe
    |              #  or
        (?:        #   Start group
          \d{3}    #     3 digits
          |
          .        #     one other character
        )          #   end group
    |              #  or
        ''         #   Double apostrophe
    )*             # End of group
   '               # Closing apostrophe
   $               # End of string
}xms;
Leon Timmermans
no I'm not testing for correctness. thanks for this solution I will test it.
filiprem
@Leon: You're missing the `\'` in this case.
KennyTM
@KennyTM: Oops. Accidentally posted an older version. Fixed it.
Leon Timmermans
@filiprem: then please explain what you are trying to do. Unquoting?
Leon Timmermans
@Leon: I'm trying to construct a Perl regular expression which will match any string literal in PostgreSQL.
filiprem
A: 

OK, here the best solution I could put together, thanks to Leon and hobbs.

Note: This is not the solution I was looking for! It still makes Perl fail with warning "recursion limit (32766) exceeded", for some long strings. (try to stuff 400k random bytes into a bytea field, then export with pg_dump --inserts).

However, it matches most bytea strings (as they appear in SQL code and in server logs), and ANSI SQL string literals. For example:

'\014cS\0059\036a4JEd\021o\005t\0015K7'
'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'
' Joe''s Mom friend\'s dog is called \'Fluffy'''

And here's the regex:

m{ 
    '                   # opening apostrophe
    (?>                 # start non-backtracking group
        [^\\']+         # anything but a backslash or an apostrophe, one or more times
    |                   # or
        (?:                 # group of
            \\ \\? [0-7]{3} # one or two backslashes and three octal digits
        )+                  # one or more times
    |                   # or
        ''              # double apostrophe
    |                   # or
        \\ [\\']        # backslash-escaped apostrophe or backslash
    )*                  # end of group
    '                   # closing apostrophe
}x;
filiprem
A: 

If you don't care about correctness, at least for now, couldn't you just try to match against regular quoted string literals? Probably something like

m{
    (?>                     # start of a quote group
        '                   # opening apostrophe
        (?>                 # start non-backtracking group
            [^\\']+         # anything but a backslash or an apostrophe, one or more times
        |                   # or
            \\ .            # backslash-escaped something
        )*                  # end of group
        '                   # closing apostrophe
    )+                      # end of a quote group, many of these
}x;
Juan Antonio