views:

35

answers:

2

I'm trying to get PostgreSQL 8.4.3 to do case insensitive pattern matching with its ~* operator when the strings contain non-ASCII characters like German umlauts. The database, terminal, and everything else is configured to use UTF-8.

Here's the problem in a nutshell:

SELECT 'Ö' ~* 'ö';      -- false

There are other variants which do work:

SELECT 'Ö' ILIKE 'ö';     -- true
SELECT 'Ö' ~* '[Öö]';     -- true
SELECT LOWER('Ö') ~* 'ö'; -- true

None of these alternatives make me especially happy. ILIKE doesn't use regular expressions. [Öö] involves rewriting the search term. LOWER() is probably the best workaround, but I'd really like to get the ~* operator working like it's supposed to.

Thanks in advance.

+1  A: 

I get true with this query:

SELECT 'Ö' ~* 'ö'; -- true

But I did use version 9.0beta2 at OS X 10.5.8 with these settings:

CREATE DATABASE test
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'nl_NL.UTF-8'
       LC_CTYPE = 'nl_NL.UTF-8'
       CONNECTION LIMIT = -1;

Edit: Same result on version 8.3.7. Looks like you have a problem with the encoding.

Frank Heikens
Very interesting, thanks for checking. I can't get this to work anywhere. My local db server (v8.3.9 on Ubuntu 8.10) won't let me specify LC_COLLATE or LC_CTYPE in the CREATE DATABASE statement. I think those were added in 8.4. Without them, the results in the new database are the same. The db on the server (v8.4.3 on Ubuntu 10.04) required me to specify TEMPLATE=template0 before I could create a database with the 'nl_NL.UTF-8' locales, but even so, the result was still the same (no match).
Zilk
Hmmm, I get a FALSE as well on Ubuntu 9.10, PostgreSQL 8.4.4. (VM running on OS X). I guess it has something to do with innitdb when the db-cluster is created. Take a look at it and/or send a message to the pgsql-mailinglist.
Frank Heikens
Got FALSE also with 8.4.3 on Fedora 12. (cluster init with LATIN9, but it let me create the database abose by specifying `TEMPLATE = template0;`
leonbloy
BTW, to check first that we are not messing with the terminal i/o, one should check that, for example: `SELECT encode('Ö','hex')` gives `c396` (with UTF8)
leonbloy
+2  A: 

This is a bug in PostgreSQL versions prior to 9.0.
It's in the 9.0 changelog: http://www.postgresql.org/docs/9.0/static/release-9-0.html#AEN99075

Here is my test in 9.0 beta2 using Ubuntu:

SELECT 'Ö' ~* 'ö';
 ?column? 
----------
 t
(1 row)
Diogo Biazus
Thanks, I missed that one.
Frank Heikens
Thank you, that explains it. Now I can at least stop looking for problems with our setup.
Zilk