views:

53

answers:

2

Hello!

Is there any way in PostgreSQL to convert UTF-8 characters to "similar" ASCII characters?

String glāžšķūņu rūķīši would have to be converted to glazskunu rukisi. UTF-8 text is not in some specific language, it might be in Latvian, Russian, English, Italian or any other language.

This is needed for using in where clause, so it might be just "comparing strings" rather than "converting strings".

I tried using convert, but it does not give desired results (e.g., select convert('Ā', 'utf8', 'sql_ascii') gives \304\200, not A).

Database is created with:

ENCODING = 'UTF8'
LC_COLLATE = 'Latvian_Latvia.1257'
LC_CTYPE = 'Latvian_Latvia.1257'

These params may be changed, if necessary.

+1  A: 

I found different ways to do this on the PostgreSQL Wiki.

In plperl:

CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;

In pure SQL:

CREATE OR REPLACE FUNCTION unaccent_string(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
    $1,
    'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
    'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
);
$$;

And in plpython:

create or replace function unaccent(text) returns text language plpythonu as $$
import unicodedata
rv = plpy.execute("select setting from pg_settings where name = 'server_encoding'");
encoding = rv[0]["setting"]
s = args[0].decode(encoding)
s = unicodedata.normalize("NFKD", s)
s = ''.join(c for c in s if ord(c) < 127)
return s
$$;

In your case, a translate() call with all the characters you can find in the UTF-8 table should be enough.

analogue
Such functions (self-written) seem to be overkill...
binaryLV
This pure SQL solution is the best I think if you need somewhat limited character set (for example only Latin based alphabets). It will be fast, is easily understandable and does not need any non-standard modules. Just get all characters from ISO-8859-1, ISO-8859-2 and ISO-8859-15 - it should be enough.
Tometzky
For now, pure SQL function is being used. Though, it's not as good as I would like it to be... About mentioned character encodings - it is NOT enough, as it does not cover Latvian (I'm from Latvia) and probably many other languages too - at least ISO-8859-4 should be added. To make it more complete, I took all letters with diactrics from http://en.wikipedia.org/wiki/Diacritic (just copied to notepad all the text and removed all "standard" characters by using ctrl+h) and wrote equal latin letters, then joined it with letters from examples provided by analogue.
binaryLV
+1  A: 

Use pg_collkey() for ICU supported unicode compare: - http://www.public-software-group.org/pg_collkey - http://russ.garrett.co.uk/tag/postgresql/

J-16 SDiZ
Can't try it right now (I don't know how to compile it on Windows). And from what I read, I'm not sure if pg_collkey would work for us, because we need to use some simple patterns, e.g., string `%mazs%rukitis%` would have to match `mazs glāžšķūņu rūķītis`.
binaryLV