Unless you have many millions of words, storing just their prefix seems like a bad plan.
For adding new data into the table, you can simply write a temporary table full of incoming words and then just aggregate and merge these in in one go at the end of an import run. That is, something like:
BEGIN;
CREATE TEMP TABLE word_stage(word text) ON COMMIT DROP;
COPY word_stage FROM stdin;
-- use pgputcopydata to send all the words to the db...
SET work_mem = 256MB; -- use lots of memory for this aggregate..
CREATE TEMP TABLE word_count_stage AS
SELECT word, count(*) as occurrences
FROM word_stage
GROUP BY word;
-- word should be unique, check that and maybe use this index for merging
ALTER TABLE word_count_stage ADD PRIMARY KEY(word);
-- this UPDATE/INSERT pair is not comodification-safe
LOCK TABLE word_count IN SHARE ROW EXCLUSIVE MODE;
-- now update the existing words in the main table
UPDATE word_count
SET word_count.occurrences = word_count.occurrences + word_count_stage.occurrences,
word_count.min_occurrences = least(word_count.occurrences, word_count_stage.occurrences),
word_count.max_occurrences = greatest(word_count.occurrences, word_count_stage.occurrences)
FROM word_count_stage
WHERE word_count_stage.word = word_count.word;
-- and add the new words, if any
INSERT INTO word_count(word, occurrences, min_occurrences, max_occurrences)
SELECT word, occurrences, occurrences, occurrences
FROM word_count_stage
WHERE NOT EXISTS (SELECT 1 FROM word_count WHERE word_count.word = word_count_stage.word);
END;
So this aggregates a batch worth of words, and then applies them to the word count table. Having indices on word_stage(word)
and word_count(word)
opens up possibilities such as using a merge if both tables are large, which you couldn't easily do by trying to update each row in the main table one at a time. Not to mention toning down on the amount of garbage potentially generated in word_count
. (Although specifying a low fillfactor like 60 or so on word_count
would be a good idea since you know it's still going to get somewhat thrashed for updates).
If your input is actually word/occurrences pairs instead of just words (your text isn't very clear) then you can take out the initial word_stage
table and just copy into word_count_stage
, or maybe you need an initial table and want to copy DISTINCT values from that initial table to word_count_stage
.
Seriously, I'd try using the entire word as a key at least initially- the numbers you're quoting are well within the bounds of usability. Also note that the approach to loading I've outlined above can be easily modified to truncate a word to the first two characters (or transform it to a memory key in any arbitrary fashion) quite easily, either by transforming as the data is moved to word_count_stage
or even right at the end by putting the transform into the update/insert statements (although you would potentially lose the benefit of having an index on the temp table that way).