views:

46

answers:

3

If I was writing a piece of software that attempted to predict what word a user was going to type next using the two previous words the user had typed, I would create two tables.

Like so:

== 1-gram table ==
Token | NextWord | Frequency
------+----------+-----------
"I"   | "like"   | 15
"I"   | "hate"   | 20

== 2-gram table ==
Token    | NextWord   | Frequency
---------+------------+-----------
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

Following this example implimentation the user types "I" and the software, using the above database, predicts that the next word the user is going to type is "hate". If the user does type "hate" then the software will then predict that the next word the user is going to type is "tomatoes".

However, this implimentation would require a table for each additional n-gram that I choose to take into account. If I decided that I wanted to take the 5 or 6 preceding words into account when predicting the next word, then I would need 5-6 tables, and an exponentially increase in space per n-gram.

What would be the best way to represent this in only one or two tables, that has no upper-limit on the number of n-grams I can support?

+1  A: 

You can actually just leave it the way you have it and use only one table. A two-gram cannot be equal to a one-gram, because the two-gram will have a space in it. Similarly, no three-gram will be equal to any two-gram because the three-gram will have two spaces. Ad infinitum.

So you can put all the 1-grams, 2-grams, etc into the Token field and none will ever collide.

danben
+2  A: 

Why not just store them all in the one table?

Token    | NextWord   | Frequency
---------+------------+-----------
"I"      | "like"     | 15
"I"      | "hate"     | 20
"I like" | "apples"   | 8
"I like" | "tomatoes" | 12
"I hate" | "tomatoes" | 20
"I hate" | "apples"   | 2

It'd then be up to your software to decide what you pass in for 'Token', and also when you insert new values (i.e. don't insert a partially-typed word). If you want to get tricky, you can have an extra column for the number of words, but I don't think that would actually be required (the number of spaces+1 is the number of words)

Dean Harding
+2  A: 

Try a two column table -

phrase, frequency

One optimisation would be to "noramalise" some words in the phrase e.g. "isn't" to "is not".

A second optimisation would be to use an MD5, CRC32 or similar hash of the phrase as the key.

James Anderson
+1 for the hash. If you're using a large corpus, or long strings, using a hash will be *essential* for adequate performance. You'll still need to keep the original string because of the inevitable collisions, alas.
egrunin