views:

73

answers:

2

I've read the papers linked to in this question. I half get it. Can someone help me figure out how to implement it?

I assume that features are generated by some heuristic. Using a POS-tagger as an example; Maybe looking at training data shows that 'bird' is tagged with NOUN in all cases, so feature f1(z_(n-1),z_n,X,n) is generated as

(if x_n = 'bird' and z_n = NOUN then 1 else 0)

Where X is the input vector and Z is the output vector. During training for weights, we find that this f1 is never violated, so corresponding weight \1 (\ for lambda) would end up positive and relatively large after training. Both guessing features and training seem challenging implement, but otherwise straightforward.

I'm lost on how one applies the model to untagged data. Initialize the output vector with some arbitrary labels, and then change labels where they increase the sum over all the \ * f?

Any help on this would be greatly appreciated.

+2  A: 

I am not completely sure if i understand you correctly but yes, on the output side each vector is augmented with a start and end symbol.

You are also right about feature functions being generated by some heuristic. Usually the heuristic is to take all possible combinations. In your example there would be a feature function for each pair of (word,tag) resulting in a high number of features. A common way to formulate such features is through the use of a feature template.

When evaluating the model you don't care about normalization, so you are looking for the sequence that gives you the largest numerator term. Usually the Viterbi algorithm is used to do so except for very large label sets - or in your example a large number of possible tags - in which case approximations are used.

Viterbi on CRFs works much like with HMMs. You start at the beginning of your sequence and compute the maximum probability ending with the word at hand, i.e. the maximum for each word over all predecessors or, as there is only one predecessor, the START symbol. In the next step you iterate over all labels, that are possible for the second element of your prediction i.e. z_2. The maximum of the unnormalized probability may be computed from both the values a the predecessor nodes, i.e. the values you computed in the first step and your model. In particular you combine the potentials of the predecessor, the transition to the node in question and the node itself and find the maximum over all predecessors. And yes since the feature functions does not limit the dependence on the source side you may take any information from it.

When you arrive at the end you walk back to determine how the maximum has been reached.

For further reading i recommend the report by Rahul Gupta.

muckl
Thank you, I found this answer very informative. For future reference, are there well-known ways to do approximations? I've only seen Forward/Backward mentioned along with Viterbi.
Loyal Tingley
+1  A: 

For linear CRFs representing simple sequences of words, you can use the Viterbi algorithm (as muckl has already mentioned).

For other topologies, you'll need to find another algorithms.

Exact inference is intractable for many topologies, and an aproximate inference algorithm, such as tree-based reparameterization, can be used.

Ken Bloom
I've made a note to look into tree-based reparamaterization, thanks!
Loyal Tingley