tags:

views:

83

answers:

3

Good night everyone,

This question leaves me a little embarassed because, of couse, I know I should be able to get the answer alone. However, my knowledge about Python is just a little bit more than nothing, so I need help from someone more experienced with it than me...

The following code comes from Norvig's "Natural Language Corpus Data" chapter in a recently edited book, and it's about transforming a sentence "likethisone" into "[like, this, one]" (that means, segmenting the word correctly)...

I have ported all of the code to C# (in fact, re-wrote the program by myself) except for the function segment, which I am having a lot of trouble even trying to understand it's syntax. Can someone please help me translating it to a more readable form in C#?

Thank you very much in advance.

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
+2  A: 

Let's tackle the first function first:

def segment(text): 
    "Return a list of words that is the best segmentation of text." 
    if not text: return [] 
    candidates = ([first]+segment(rem) for first,rem in splits(text)) 
    return max(candidates, key=Pwords) 

It takes a word and returns the most likely list of words that it could be, so its signature will be static IEnumerable<string> segment(string text). Obviously if text is an empty string, its result should be an empty list. Otherwise, it creates a recursive list comprehension defining the possible candidate lists of words and returns the maximum based on its probability.

static IEnumerable<string> segment(string text)
{
    if (text == "") return new string[0]; // C# idiom for empty list of strings
    var candidates = from pair in splits(text)
                     select new[] {pair.Item1}.Concat(segment(pair.Item2));
    return candidates.OrderBy(Pwords).First();
}

Of course, now we have to translate the splits function. Its job is to return a list of all possible tuples of the beginning and end of a word. It's fairly straightforward to translate:

static IEnumerable<Tuple<string, string>> splits(string text, int L = 20)
{
    return from i in Enumerable.Range(1, Math.Min(text.Length, L))
           select Tuple.Create(text.Substring(0, i), text.Substring(i));
}

Next is Pwords, which just calls the product function on the result of Pw on each word in its input list:

static double Pwords(IEnumerable<string> words)
{
    return product(from w in words select Pw(w));
}

And product is pretty simple:

static double product(IEnumerable<double> nums)
{
    return nums.Aggregate((a, b) => a * b);
}

ADDENDUM:

Looking at the full source code, it is apparent that Norvig intends the results of the segment function to be memoized for speed. Here's a version that provides this speed-up:

static Dictionary<string, IEnumerable<string>> segmentTable =
   new Dictionary<string, IEnumerable<string>>();

static IEnumerable<string> segment(string text)
{
    if (text == "") return new string[0]; // C# idiom for empty list of strings
    if (!segmentTable.ContainsKey(text))
    {
        var candidates = from pair in splits(text)
                         select new[] {pair.Item1}.Concat(segment(pair.Item2));
        segmentTable[text] = candidates.OrderBy(Pwords).First().ToList();
    }
    return segmentTable[text];
}
Gabe
Thank you very much! It's a very nice piece of code Gabe. I just learned a lot of features of C# that were new for me with this.
Miguel
A: 

I don't know C# at all, but I can explain how the Python code works.

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

The first line,

@memo

is a decorator. This causes the function, as defined in the subsequent lines, to be wrapped in another function. Decorators are commonly used to filter inputs and outputs. In this case, based on the name and the role of the function it's wrapping, I gather that this function memoizes calls to segment.

Next:

def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []

Declares the function proper, gives a docstring, and sets the termination condition for this function's recursion.

Next is the most complicated line, and probably the one that gave you trouble:

    candidates = ([first]+segment(rem) for first,rem in splits(text))

The outer parentheses, combined with the for..in construct, create a generator expression. This is an efficient way of iterating over a sequence, in this case splits(text). Generator expressions are sort of a compact for-loop that yields values. In this case, the values become the elements of the iteration candidates. "Genexps" are similar to list comprehensions, but achieve greater memory efficiency by not retaining each value that they produce.

So for each value in the iteration returned by splits(text), a list is produced by the generator expression.

Each of the values from splits(text) is a (first, rem) pair.

Each produced list starts with the object first; this is expressed by putting first inside a list literal, i.e. [first]. Then another list is added to it; that second list is determined by a recursive call to segment. Adding lists in Python concatenates them, i.e. [1, 2] + [3, 4] gives [1, 2, 3, 4].

Finally, in

    return max(candidates, key=Pwords)

the recursively-determined list iteration and a key function are passed to max. The key function is called on each value in the iteration to get the value used to determine whether or not that list has the highest value in the iteration.

intuited
A: 

Thanks very much everyone. However, there is still a small problem. I am using Gabe's solution and when I try to compile it says something like

Error 1 Could not find an implementation of the query pattern for source type 'string[,]'. 'Select' not found. Are you missing a reference to 'System.Core.dll' or a using directive for 'System.Linq'?

Can someone please help me with this error? The directive using System.Linq is in the code.

Miguel
Miguel: You have a `string[,]` which is a 2-dimensional array of strings. Arrays with more than one dimension do not support LINQ. I don't know how you created it, but I've never seen one. You should either modify your question or ask a new one to get help with this.
Gabe