ansaurus

Question

Co-occurrence of words in documents with Google big table

Answer 1

A:

There is nothing inherent to the AppEngine datastore that will help you with this problem. You will need to index the words in the documents programatically.

Adam Crossland 2010-01-06 21:36:40

Answer 2

A:

Using list-properties and merge-join is the best way to answer set membership questions in Google App Engine: Building Scalable, Complex Apps on App Engine.

You could setup your model as follows:

class Document(db.Model):
    word = db.StringListProperty()
    name = db.StringProperty()

...

doc.word = ["google", "app", "engine"]

Then it would be easy to query for co-occurrence. For example, which documents have the words google and engine?

results = db.GqlQuery(
"SELECT * FROM Documents "
"WHERE word = 'google'"
"  and word = 'engine'")

docs = [d.name for d in results]

There are some limitations, though. From the presentation:

Index writes are done in parallel on Bigtable Fast-- e.g., update a list property of 1000 items with 1000 row writes simultaneously! Scales linearly with number of items Limited to 5000 indexed properties per entity

But queries must unpackage all result entities When list size > ~100, reads are too expensive! Slow in wall-clock time Costs too much CPU

You could also create a model of words and save in the StringListProperty only their keys, but depending on the size of your documents even that would not be feasible.

jbochi 2010-01-07 09:52:05

thanks I forgot to mention that I am looking for third word which does not appear in query but is frequently found with words W1 and W2, I adjusted my question

zebra 2010-01-07 12:47:46

`and word = 'W3'` would do the trick, but do you need to do this online? I think it's a better idea to do this kind of processing offline, in memory.

jbochi 2010-01-08 01:28:20

but I need 'W3' as a result of query, not as query input, it should "suggest" the words co-occurring with other two

zebra 2010-01-08 11:31:22

Why is this marked as the answer? It answers a different question to the one that was asked.

Nick Johnson 2010-01-13 15:21:40

Answer 3

+1 A:

The only way to do this that I'm aware of is to index all 3-tuples of words, with their counts. Your kind would look something like this:

class Tuple(db.Model):
  words = db.StringListProperty()
  count = db.IntegerProperty()

Then, you need to insert or update the appropriate tuple entity for each set of 3 unique words in your text. Eg, the string "the king is dead" would result in the tuples (the, king, is), (the, king, dead), (the, is, dead), (king, is, dead)... This obviously results in an exponential explosion in entries, but I'm not aware of any way around that for what you want to do.

To find the suggestions, you'd do something like this:

q = Tuple.all().filter('word =', w1).filter('word =', w2).order('-count')

In the broader sense of recommendation algorithms, however, there is a lot of research into more efficient ways to do this. It's an open question, as evidenced by the existence of the Netflix challenge.

Nick Johnson 2010-01-13 15:21:16

ansaurus

tags:

views:

answers:

Co-occurrence of words in documents with Google big table

related questions