views:

61

answers:

3

Hi,

Given document-D1: containing words (w1,w2,w3) and document D2 and words (w2,w3..) and document Dn and words ( w1,w2, wn)

Can I structure my data in big table to answer the questions like: which words occur most frequently with w1, or which words occur most frequently with w1 and w2.

What I am trying to achieve is to find the third word Wx (suggestion) which ocures most frequently in documents togehter with given words W1 and W2

I know the solution in SQL, but is it possible with google-big table?

I know I would have to build my indices by myself, the question is how should I structure them to avoid index explosion

thanks almir

A: 

There is nothing inherent to the AppEngine datastore that will help you with this problem. You will need to index the words in the documents programatically.

Adam Crossland
A: 

Using list-properties and merge-join is the best way to answer set membership questions in Google App Engine: Building Scalable, Complex Apps on App Engine.

You could setup your model as follows:

class Document(db.Model):
    word = db.StringListProperty()
    name = db.StringProperty()

...

doc.word = ["google", "app", "engine"]

Then it would be easy to query for co-occurrence. For example, which documents have the words google and engine?

results = db.GqlQuery(
"SELECT * FROM Documents "
"WHERE word = 'google'"
"  and word = 'engine'")

docs = [d.name for d in results]

There are some limitations, though. From the presentation:

Index writes are done in parallel on Bigtable Fast-- e.g., update a list property of 1000 items with 1000 row writes simultaneously! Scales linearly with number of items Limited to 5000 indexed properties per entity

But queries must unpackage all result entities When list size > ~100, reads are too expensive! Slow in wall-clock time Costs too much CPU

You could also create a model of words and save in the StringListProperty only their keys, but depending on the size of your documents even that would not be feasible.

jbochi
thanks I forgot to mention that I am looking for third word which does not appear in query but is frequently found with words W1 and W2, I adjusted my question
zebra
`and word = 'W3'` would do the trick, but do you need to do this online? I think it's a better idea to do this kind of processing offline, in memory.
jbochi
but I need 'W3' as a result of query, not as query input, it should "suggest" the words co-occurring with other two
zebra
Why is this marked as the answer? It answers a different question to the one that was asked.
Nick Johnson
+1  A: 

The only way to do this that I'm aware of is to index all 3-tuples of words, with their counts. Your kind would look something like this:

class Tuple(db.Model):
  words = db.StringListProperty()
  count = db.IntegerProperty()

Then, you need to insert or update the appropriate tuple entity for each set of 3 unique words in your text. Eg, the string "the king is dead" would result in the tuples (the, king, is), (the, king, dead), (the, is, dead), (king, is, dead)... This obviously results in an exponential explosion in entries, but I'm not aware of any way around that for what you want to do.

To find the suggestions, you'd do something like this:

q = Tuple.all().filter('word =', w1).filter('word =', w2).order('-count')

In the broader sense of recommendation algorithms, however, there is a lot of research into more efficient ways to do this. It's an open question, as evidenced by the existence of the Netflix challenge.

Nick Johnson