ansaurus

Question

Answer 1

+1 A:

I found different ways to do this on the PostgreSQL Wiki.

In plperl:

CREATE OR REPLACE FUNCTION unaccent_string(text) RETURNS text AS $$
my ($input_string) = @_;
$input_string =~ s/[âãäåāăą]/a;
$input_string =~ s/[ÁÂÃÄÅĀĂĄ]/A;
$input_string =~ s/[èééêëēĕėęě]/e;
$input_string =~ s/[ĒĔĖĘĚ]/E;
$input_string =~ s/[ìíîïìĩīĭ]/i;
$input_string =~ s/[ÌÍÎÏÌĨĪĬ]/I;
$input_string =~ s/[óôõöōŏő]/o;
$input_string =~ s/[ÒÓÔÕÖŌŎŐ]/O;
$input_string =~ s/[ùúûüũūŭů]/u;
$input_string =~ s/[ÙÚÛÜŨŪŬŮ]/U;
return $input_string;
$$ LANGUAGE plperl;

In pure SQL:

CREATE OR REPLACE FUNCTION unaccent_string(text)
RETURNS text
IMMUTABLE
STRICT
LANGUAGE SQL
AS $$
SELECT translate(
    $1,
    'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
    'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
);
$$;

And in plpython:

create or replace function unaccent(text) returns text language plpythonu as $$
import unicodedata
rv = plpy.execute("select setting from pg_settings where name = 'server_encoding'");
encoding = rv[0]["setting"]
s = args[0].decode(encoding)
s = unicodedata.normalize("NFKD", s)
s = ''.join(c for c in s if ord(c) < 127)
return s
$$;

In your case, a translate() call with all the characters you can find in the UTF-8 table should be enough.

analogue 2010-06-16 08:51:34

Such functions (self-written) seem to be overkill...

binaryLV 2010-06-16 08:55:57

This pure SQL solution is the best I think if you need somewhat limited character set (for example only Latin based alphabets). It will be fast, is easily understandable and does not need any non-standard modules. Just get all characters from ISO-8859-1, ISO-8859-2 and ISO-8859-15 - it should be enough.

Tometzky 2010-06-17 08:57:43

For now, pure SQL function is being used. Though, it's not as good as I would like it to be... About mentioned character encodings - it is NOT enough, as it does not cover Latvian (I'm from Latvia) and probably many other languages too - at least ISO-8859-4 should be added. To make it more complete, I took all letters with diactrics from http://en.wikipedia.org/wiki/Diacritic (just copied to notepad all the text and removed all "standard" characters by using ctrl+h) and wrote equal latin letters, then joined it with letters from examples provided by analogue.

binaryLV 2010-06-17 09:16:56

Answer 2

+1 A:

Use pg_collkey() for ICU supported unicode compare: - http://www.public-software-group.org/pg_collkey - http://russ.garrett.co.uk/tag/postgresql/

J-16 SDiZ 2010-06-16 08:52:38

Can't try it right now (I don't know how to compile it on Windows). And from what I read, I'm not sure if pg_collkey would work for us, because we need to use some simple patterns, e.g., string `%mazs%rukitis%` would have to match `mazs glāžšķūņu rūķītis`.

binaryLV 2010-06-16 13:05:12

ansaurus

tags:

views:

answers:

comparing strings in PostgreSQL

related questions