(All this was done with PostgreSQL 9.0.1 on Ubuntu 10.04, FWIW.)
I've tried writing my own lexize function for PostgreSQL specifically for names, based on the double metaphone algorithm. To produce the C lexize function, I started with the dict_xsyn example, and basically swapped double metaphone in for the synonym lookup.
But maybe 20% of the time to_tsvector
is invoked, it seems to be dropping lexemes. As an example, I created a table containing the US Census Bureau's list of names.
db=# select * from names order by rank limit 8;
name | freq | cumfreq | rank
----------+-------+---------+------
SMITH | 1.006 | 1.006 | 1
JOHNSON | 0.81 | 1.816 | 2
WILLIAMS | 0.699 | 2.515 | 3
JONES | 0.621 | 3.136 | 4
BROWN | 0.621 | 3.757 | 5
DAVIS | 0.48 | 4.237 | 6
MILLER | 0.424 | 4.66 | 7
WILSON | 0.339 | 5 | 8
(8 rows)
We can then add a vector column, and populate it with to_tsvector
of my metaphone dictionary:
db=# alter table names add column vec tsvector;
ALTER TABLE
db=# update names set vec=to_tsvector('public.names', name);
UPDATE 88799
db=# select * from names order by rank limit 8;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-------------------------------
SMITH | 1.006 | 1.006 | 1 |
JOHNSON | 0.81 | 1.816 | 2 | 'ANSN':1 'JNSN':1 'johnson':1
WILLIAMS | 0.699 | 2.515 | 3 |
JONES | 0.621 | 3.136 | 4 |
BROWN | 0.621 | 3.757 | 5 |
DAVIS | 0.48 | 4.237 | 6 |
MILLER | 0.424 | 4.66 | 7 | 'MLR':1 'miller':1
WILSON | 0.339 | 5 | 8 | 'ALSN':1 'FLSN':1 'wilson':1
(8 rows)
A bunch of the vec fields are just empty! In fact:
db=# select count(1) from names where vec = to_tsvector('');
count
-------
41101
(1 row)
I can do that bulk update, and then count the mismatched fields repeatedly, and get different counts each time, because different rows are failing to have real vectors computed.
But I think my lexize function itself is ok?
db=# alter table names drop column vec; alter table names add column lexemes varchar[];
ALTER TABLE
ALTER TABLE
db=# update names set lexemes=ts_lexize('dmetaphonedict', name);
UPDATE 88799
db=# select * from names order by rank limit 10;
name | freq | cumfreq | rank | lexemes
----------+-------+---------+------+----------------------
SMITH | 1.006 | 1.006 | 1 | {smith,SM0,XMT}
JOHNSON | 0.81 | 1.816 | 2 | {johnson,JNSN,ANSN}
WILLIAMS | 0.699 | 2.515 | 3 | {williams,ALMS,FLMS}
JONES | 0.621 | 3.136 | 4 | {jones,JNS,ANS}
BROWN | 0.621 | 3.757 | 5 | {brown,PRN}
DAVIS | 0.48 | 4.237 | 6 | {davis,TFS}
MILLER | 0.424 | 4.66 | 7 | {miller,MLR}
WILSON | 0.339 | 5 | 8 | {wilson,ALSN,FLSN}
MOORE | 0.312 | 5.312 | 9 | {moore,MR}
TAYLOR | 0.311 | 5.623 | 10 | {taylor,TLR}
(10 rows)
db=# select count(1) from names where array_length(lexemes,1)=0;
count
-------
0
(1 row)
I can do that over and over and get 0 mismatches each time.
I'm creating the relevant text search dictionaries and configurations with:
CREATE OR REPLACE FUNCTION ddmetaphonedict_init(internal)
RETURNS internal
AS '$libdir/dict_dmetaphone'
LANGUAGE C STRICT;
CREATE OR REPLACE FUNCTION ddmetaphonedict_lexize(internal, internal, internal, internal)
RETURNS internal
AS '$libdir/dict_dmetaphone'
LANGUAGE C STRICT;
CREATE TEXT SEARCH TEMPLATE dmetaphonedict_template (
LEXIZE = ddmetaphonedict_lexize,
INIT = ddmetaphonedict_init
);
CREATE TEXT SEARCH DICTIONARY dmetaphonedict (
TEMPLATE = dmetaphonedict_template
);
COMMENT ON TEXT SEARCH DICTIONARY dmetaphonedict IS 'dictionary for names, using dmetaphone';
create text search configuration names (copy=english);
alter text search configuration names alter mapping for asciiword, asciihword, word, hword with dmetaphonedict;
Pretty much straight out of the dict_xsyn example.
What's going on? What can do I do to figure out where these things are being dropped?