ansaurus

Question

Storing n-grams in database in < n number of tables.

Answer 1

+1 A:

You can actually just leave it the way you have it and use only one table. A two-gram cannot be equal to a one-gram, because the two-gram will have a space in it. Similarly, no three-gram will be equal to any two-gram because the three-gram will have two spaces. Ad infinitum.

So you can put all the 1-grams, 2-grams, etc into the Token field and none will ever collide.

danben 2010-05-17 01:22:18

Answer 2

+2 A:

Why not just store them all in the one table?

Token    | NextWord   | Frequency
---------+------------+-----------
"I"      | "like"     | 15
"I"      | "hate"     | 20
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

It'd then be up to your software to decide what you pass in for 'Token', and also when you insert new values (i.e. don't insert a partially-typed word). If you want to get tricky, you can have an extra column for the number of words, but I don't think that would actually be required (the number of spaces+1 is the number of words)

Dean Harding 2010-05-17 01:22:28

Answer 3

+2 A:

Try a two column table -

phrase, frequency

One optimisation would be to "noramalise" some words in the phrase e.g. "isn't" to "is not".

A second optimisation would be to use an MD5, CRC32 or similar hash of the phrase as the key.

James Anderson 2010-05-17 01:24:42

+1 for the hash. If you're using a large corpus, or long strings, using a hash will be *essential* for adequate performance. You'll still need to keep the original string because of the inevitable collisions, alas.

egrunin 2010-05-18 07:14:10

ansaurus

tags:

views:

answers:

Storing n-grams in database in < n number of tables.

related questions