views:

137

answers:

3

I'm asking about the new labs feature "Google Scribe." Here is the link: http://scribe.googlelabs.com/

I am interested in the backend and the frontend, but mainly the backend. I want to build something similar with a very specific data set (derived from my own documents). I think the frontend of it is fairly straightforward, and I could probably even use an existing autocomplete plugin for the task.

+2  A: 

Possible implementation suggestion:

Backend: Build and maintain a NxNxW sparse matrix A (e.g. implemented as a Hash) where N is the size of your vocabulary and W is the maximum context (in words) you wish to maintain (e.g. W=4 might be OK.) Go over some sample data to seed/initialize A such that A[n1,n2,w] counts the number of times word n2 appeared at the wth position after word n1 (respect sentence boundaries.)

Frontend: As the user types, ask the backend to use A to evaluate (and rank) the most probable successor words based on the last W words completely typed out by the user in the current sentence; only show those suggestions which start with what the user is in the progress of typing (i.e. the user's "current" (partial) word.)

Optionally have the backend update M based on the words that the user has completed typing, either on the fly (challenging when the user goes back to perform corrections), or as the finalized text is submitted (easiest), or via some periodic job evaluating texts submitted since the last job run.

Cheers, V.

vladr
As `aleemb` correctly points out, the formal term for the above would be a `W`-degree Markov chain.
vladr
+1  A: 

(I am not positive on this, please correct me if I am wrong)

The system that Google Scribe uses (or at least a very similar one) would essentially use a tree-like data structure, for storing all possible words. Some form of search algorithm which sees all possible ways you could finish your word, based on known vocabulary. (Probably base doff of older search queries stored in their database) and orders them by how commonly they occur.

For instance:

I type: 'a'

Vocab: 'at' 'apple' 'atrocious'

So: 'at' is used the most, 'apple' second most, and 'atrocious' the least.

Like I said, I'm not sure if this is the system they use, but it should have similar results.

For retrieving occurrence likelihood, you could scan the documents you're searching, or just store on a query-by-query basis to check for your past searches.

TaslemGuy
+2  A: 

You need to use Markov Chains.

You may want to start by looking here. The sample output is interesting too.

aleemb