views:

67

answers:

1

Hey,

I'm doing a statistical research application. I need to store words according to 2 initial letters which is 676 combinations and each word has its number of occurrences (minimal, maximal, average) in text. I'm not sure how the model/schema should look like. There will be a lot of checking whether the keyword was already persisted. I appreciate your suggestions.


Edit: I'll be using either mysql or postgresql + spring templates

+1  A: 

Unless you have many millions of words, storing just their prefix seems like a bad plan.

For adding new data into the table, you can simply write a temporary table full of incoming words and then just aggregate and merge these in in one go at the end of an import run. That is, something like:

BEGIN;
CREATE TEMP TABLE word_stage(word text) ON COMMIT DROP;
COPY word_stage FROM stdin;
-- use pgputcopydata to send all the words to the db...
SET work_mem = 256MB; -- use lots of memory for this aggregate..
CREATE TEMP TABLE word_count_stage AS
    SELECT word, count(*) as occurrences
    FROM word_stage
    GROUP BY word;
-- word should be unique, check that and maybe use this index for merging
ALTER TABLE word_count_stage ADD PRIMARY KEY(word);
-- this UPDATE/INSERT pair is not comodification-safe
LOCK TABLE word_count IN SHARE ROW EXCLUSIVE MODE;
-- now update the existing words in the main table
UPDATE word_count
SET word_count.occurrences = word_count.occurrences + word_count_stage.occurrences,
    word_count.min_occurrences = least(word_count.occurrences, word_count_stage.occurrences),
    word_count.max_occurrences = greatest(word_count.occurrences, word_count_stage.occurrences)
FROM word_count_stage
WHERE word_count_stage.word = word_count.word;
-- and add the new words, if any
INSERT INTO word_count(word, occurrences, min_occurrences, max_occurrences)
  SELECT word, occurrences, occurrences, occurrences
  FROM word_count_stage
  WHERE NOT EXISTS (SELECT 1 FROM word_count WHERE word_count.word = word_count_stage.word);
END;

So this aggregates a batch worth of words, and then applies them to the word count table. Having indices on word_stage(word) and word_count(word) opens up possibilities such as using a merge if both tables are large, which you couldn't easily do by trying to update each row in the main table one at a time. Not to mention toning down on the amount of garbage potentially generated in word_count. (Although specifying a low fillfactor like 60 or so on word_count would be a good idea since you know it's still going to get somewhat thrashed for updates).

If your input is actually word/occurrences pairs instead of just words (your text isn't very clear) then you can take out the initial word_stage table and just copy into word_count_stage, or maybe you need an initial table and want to copy DISTINCT values from that initial table to word_count_stage.

Seriously, I'd try using the entire word as a key at least initially- the numbers you're quoting are well within the bounds of usability. Also note that the approach to loading I've outlined above can be easily modified to truncate a word to the first two characters (or transform it to a memory key in any arbitrary fashion) quite easily, either by transforming as the data is moved to word_count_stage or even right at the end by putting the transform into the update/insert statements (although you would potentially lose the benefit of having an index on the temp table that way).

araqnid
That's great solution, I'm going to do it this way right now. Thank you
lisak
@lisak ntw since you imply you'll be using Java- I briefly mentioned "use pgputcopydata..." to put the data into word_stage. In Java terms that translates to creating a PGCopyOutputStream on the connection- which you will have to do yourself since it's postgresql-specific. You can of course populate the table using standard Spring stuff and INSERT statements, but using COPY should be faster.
araqnid