tags:

views:

92

answers:

1

I'm using a bytea type in PostgreSQL, which, to my understanding, contains just a series of bytes. However, I can't get it to play well with nulls. For example:

=# select length(E'aa\x00aa'::bytea);
 length
--------
      2
(1 row)

I was expecting 5. Also:

=# select md5(E'aa\x00aa'::bytea);
               md5
----------------------------------
 4124bc0a9335c27f086f24ba207a4912
(1 row)

That's the MD5 of "aa", not "aa\x00aa". Clearly, I'm Doing It Wrong, but I don't know what I'm doing wrong. I'm also on an older version of Postgres (8.1.11) for reasons outside of my control. (I'll see if this behaves the same on the latest Postgres as soon as I get home...)

+3  A: 

Try this:

# select length(E'aa\\000aa'::bytea);
 length
--------
      5

Updated: Why the original didn't work? First, understand the difference between one slash and two:

pg=# select E'aa\055aa', length(E'aa\055aa') ;
 ?column? | length
----------+--------
 aa-aa    |      5
(1 row)

pg=# select E'aa\\055aa', length(E'aa\\055aa') ;
 ?column? | length
----------+--------
 aa\055aa |      8

In the first case, I'm writing a literal string, 4 characters unescaped('a') and one escaped. The slash is consumed by the parser in a first pass, which converts the full \055 to a single char ('-' in this case).

In the second case, the first slash just escapes the second, the pair \\ is translated by the parser to a single \ and the 055 is seen as three characters.

Now, when converting a text to a bytea, escape characters (in a already parsed or produced text) are parsed/interpreted again! (Yes, this is confusing).

So, when I write

 select E'aa\000aa'::bytea;

in the first parsing, the literal E'aa\000aa' is converted to an internal text with a null character in the third position (and depending on your postgresql version, the null character is interpreted as an EOS, and the text is assumed to be of length two - or in other versions an illegal string error is thrown).

Instead, when I write

 select E'aa\\000aa'::bytea;

in the first parsing, the literal string "aa\000aa" (eight characters) is seen, and is asigned to a text; then in the casting to bytea, it is parsed again, and the sequence of characters '\000' is interpreted as a null byte.

IMO postgresql kind of sucks here.

leonbloy
Ok. Now why does this work?
Thanatos
Ah - I think I see: Postgres strings can't represent binary data, so they use an escaped form, where a null is \000, which we must type as '\\000'. I see now the PQescapeBytea and PQunescapeBytea - I'm assuming that if I select a bytea column, I'll get this intermediate escaped form back, which I must then pass through PQunescapeBytea. This all makes more sense now.
Thanatos
Yes, thanks for your update. That confirms what I was thinking, and where you marked "(Yes, this is confusing.)" I wholeheartedly agree!
Thanatos
The explanatory update makes this a great answer.
araqnid
Sometimes it's simpler to specify base64 input for a binary literal, e.g. `decode('AAAA','base64')` instead of `E'\\000\\000\\000'::bytea`. 9.0 supports a simpler hex encoding for bytea values which is rather clearer (E'\\x000000' and no slavish ::bytea suffix needed)
araqnid