views:

426

answers:

4

My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words.

I have created an N-gram model (for now, N=2) based on the letters in a bunch of text. Now, given a random sequence of letters, I would like to permute them into the most likely sequence according to the transition probabilities. I thought I would need the Viterbi algorithm when I started this, but as I look deeper, the Viterbi algorithm optimizes a sequence of hidden random variables based on the observed output. I am trying to optimize the output sequence.

Is there a well-known algorithm for this that I can read about? Or am I on the right track with Viterbi and I'm just not seeing how to apply it?

Update

I have added a bounty to ask for more insight into this problem. (Analysis explaining why an efficient approach isn't possible, other heuristics/approximations besides simulated annealing, etc.)

+3  A: 

If I understand your problem correctly, you are searching all permutations of letters in a word for the one with the lowest product of 2-gram probabilities.

If your word is too long to simply brute force all combinations, I've found that stochastic optimization algorithms produce good results in a short time. I (having a mathematical background) have done some work on the algorithm "Simulated Annealing", which I think would fit nicely to your problem. And it is pretty easy to implement. =)

Jens
+3  A: 

Consider the set of K letters as vertices in a graph.

Add directed edges to represent the 2-grams from each letter to all the others, with weights that correspond to their probabilities.

A "word", then, is a path through the (complete, directed) graph.

You are looking for the best (least- or most-weighted) "word" (path) that uses all the letters (visits all the vertices).

This is the asymmetric Traveling Salesman Problem. It's NP-complete. I don't think it's going to get easier if you use N-grams bigger than N=2, so you're not likely to find an efficient algorithm, but let us know if you do ;)

(Simulated Annealing or something like it is probably the way to go)

Zac Thompson
+1  A: 

You could also do it stochastically with a Markov chain. For starters, make sure that your N-gram table includes a "beginning of word" symbol; then find the available transitions from that state, and filter them so that they only include available letters from your pool, and choose randomly among them using weighted probabilities. Then find the transitions from the next state, filtering down to the still-available letters, and end when there are no more letters in the pool (or, if you reach a state that you can't transition out of, go back to the beginning and try again).

You may actually find it useful that this is more random than some of the other available options, and if it's too random you have the option of massaging the probabilities, or simply generating some number n (say 100) of random words, sorting them by their "likelihood", and then choosing randomly from among the top m (perhaps 10), which gives you relatively fine control over whether the words you generate from any bag of letters are more consistent or more random.

hobbs
+3  A: 

As an exercise, I wrote a simple implementation of Markov Chains in MATLAB. Basically its a letter-based probabilistic model to generating words.

function mc = MC(numStates)
    N = numStates;
    PI = ones(1,N)/N;
    TR = ones(N,N)/N;
    mc = struct('logprob',@logprob, 'train',@train, ...
                'sample',@sample, 'sampleFiltered',@sampleFiltered);

    function train(seqs)
        PI = 1 + histc(cellfun(@(c)c(1), seqs)', 1:N); %#'
        TR = ones(N,N);
        for i=1:numel(seqs)
            ind = sub2ind([N N], seqs{i}(1:end-1), seqs{i}(2:end));
            TR = TR + reshape(histc(ind,1:N*N), [N N]);
        end
        PI = bsxfun(@rdivide, PI, sum(PI,2)+(sum(PI,2)==0));
        TR = bsxfun(@rdivide, TR, sum(TR,2)+(sum(TR,2)==0));
    end

    function seq = sample(len)
        seq = zeros(1,len);
        seq(1) = randsample(1:N, 1, true, PI);
        for t=2:len
            seq(t) = randsample(1:N, 1, true, TR(seq(t-1),:));
        end
    end

    function seq = sampleFiltered(allowed)
        len = numel(allowed);
        seq = zeros(1,len);
        seq(1) = randsample(allowed, 1, true, PI(allowed));
        allowed( find(allowed==seq(1),1,'first') ) = [];
        for t=2:len-1
            seq(t) = randsample(allowed, 1, true, TR(seq(t-1),allowed));
            allowed( find(allowed==seq(t),1,'first') ) = [];
        end
        seq(t) = allowed;
        seq = seq(seq~=0);
    end

    function LL = logprob(seq)
        LL = log(PI(seq(1))) + ...
             sum( log(TR(sub2ind([N N],seq(1:end-1),seq(2:end)))) );
    end
end

We will need some text to train the model. We use 'The Wonderful Wizard of Oz' from Project Gutenberg.

%# read the text document
str = lower( urlread('http://www.gutenberg.org/files/55/55.txt') );
SP = char(32);                        %# delimiter (space)
str( ~isstrprop(str, 'alpha') ) = SP; %# replace non-letters with spaces
str( findstr(str, [SP SP]) ) = [];    %# consecutive spaces as one
idx = ( str == SP );                  %# location of spaces
df = diff([1 idx 1]);
len = find(df > 0) - find(df < 0);    %# length of each word
[seqs gn] = grp2idx( str(~idx)' );    %#' map letters to numbers starting from 1
seqs = mat2cell(seqs', 1, len)';      %# put each word in a separate cell
N = length(gn);                       %# A to Z

Finally, we use the model to either sample random words or sample words from a set of letters:

%# train Markov chain
mc = MC(N);
mc.train(seqs);

%# sample a random word
seq = mc.sample( randi([3 10]) );
fprintf('word = %s , logP(word)=%f\n', [gn{seq}], mc.logprob(seq))

%# sample a word from a set of letters
letters = lower('markovchains');
lettersIdx = cellfun(@(c) find(strcmp(c,gn)), cellstr(letters'));   %#'
seq = mc.sampleFiltered(lettersIdx);
fprintf('word = %s , logP(word)=%f\n', [gn{seq}], mc.logprob(seq))

Here's a bunch of examples generated from the letters 'markovchains', along with log-probability of the word given the model:

word = mivorancask , logP(word)=-29.610819
word = arknoamshiv , logP(word)=-32.496090
word = ancoramshik , logP(word)=-29.299897
word = orchisankav , logP(word)=-29.987204
word = avinchasorm , logP(word)=-27.178507
word = aronchaskim , logP(word)=-25.651964

You can see that although none are correct words, they are still better than just a random sequence of letters. Obviously using only the previous character to generate the next one is not enough, still it can be easily extended to more sophisticated cases (N-gram).

The nice thing about such an approach is that its not restricted to one language, and can be adapted to any other by simply feeding it documents from your language of choice.

Amro