ansaurus

Question

A little help needed in code translation (Python to C#)

Answer 1

+2 A:

Let's tackle the first function first:

def segment(text): 
    "Return a list of words that is the best segmentation of text." 
    if not text: return [] 
    candidates = ([first]+segment(rem) for first,rem in splits(text)) 
    return max(candidates, key=Pwords)

It takes a word and returns the most likely list of words that it could be, so its signature will be static IEnumerable<string> segment(string text). Obviously if text is an empty string, its result should be an empty list. Otherwise, it creates a recursive list comprehension defining the possible candidate lists of words and returns the maximum based on its probability.

static IEnumerable<string> segment(string text)
{
    if (text == "") return new string[0]; // C# idiom for empty list of strings
    var candidates = from pair in splits(text)
                     select new[] {pair.Item1}.Concat(segment(pair.Item2));
    return candidates.OrderBy(Pwords).First();
}

Of course, now we have to translate the splits function. Its job is to return a list of all possible tuples of the beginning and end of a word. It's fairly straightforward to translate:

static IEnumerable<Tuple<string, string>> splits(string text, int L = 20)
{
    return from i in Enumerable.Range(1, Math.Min(text.Length, L))
           select Tuple.Create(text.Substring(0, i), text.Substring(i));
}

Next is Pwords, which just calls the product function on the result of Pw on each word in its input list:

static double Pwords(IEnumerable<string> words)
{
    return product(from w in words select Pw(w));
}

And product is pretty simple:

static double product(IEnumerable<double> nums)
{
    return nums.Aggregate((a, b) => a * b);
}

ADDENDUM:

Looking at the full source code, it is apparent that Norvig intends the results of the segment function to be memoized for speed. Here's a version that provides this speed-up:

static Dictionary<string, IEnumerable<string>> segmentTable =
   new Dictionary<string, IEnumerable<string>>();

static IEnumerable<string> segment(string text)
{
    if (text == "") return new string[0]; // C# idiom for empty list of strings
    if (!segmentTable.ContainsKey(text))
    {
        var candidates = from pair in splits(text)
                         select new[] {pair.Item1}.Concat(segment(pair.Item2));
        segmentTable[text] = candidates.OrderBy(Pwords).First().ToList();
    }
    return segmentTable[text];
}

Gabe 2010-10-15 21:50:20

Thank you very much! It's a very nice piece of code Gabe. I just learned a lot of features of C# that were new for me with this.

Miguel 2010-10-15 22:00:50

Answer 2

A:

I don't know C# at all, but I can explain how the Python code works.

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

The first line,

@memo

is a decorator. This causes the function, as defined in the subsequent lines, to be wrapped in another function. Decorators are commonly used to filter inputs and outputs. In this case, based on the name and the role of the function it's wrapping, I gather that this function memoizes calls to segment.

def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []

Declares the function proper, gives a docstring, and sets the termination condition for this function's recursion.

Next is the most complicated line, and probably the one that gave you trouble:

    candidates = ([first]+segment(rem) for first,rem in splits(text))

The outer parentheses, combined with the for..in construct, create a generator expression. This is an efficient way of iterating over a sequence, in this case splits(text). Generator expressions are sort of a compact for-loop that yields values. In this case, the values become the elements of the iteration candidates. "Genexps" are similar to list comprehensions, but achieve greater memory efficiency by not retaining each value that they produce.

So for each value in the iteration returned by splits(text), a list is produced by the generator expression.

Each of the values from splits(text) is a (first, rem) pair.

Each produced list starts with the object first; this is expressed by putting first inside a list literal, i.e. [first]. Then another list is added to it; that second list is determined by a recursive call to segment. Adding lists in Python concatenates them, i.e. [1, 2] + [3, 4] gives [1, 2, 3, 4].

Finally, in

    return max(candidates, key=Pwords)

the recursively-determined list iteration and a key function are passed to max. The key function is called on each value in the iteration to get the value used to determine whether or not that list has the highest value in the iteration.

intuited 2010-10-15 22:05:54

Answer 3

A:

Thanks very much everyone. However, there is still a small problem. I am using Gabe's solution and when I try to compile it says something like

Error 1 Could not find an implementation of the query pattern for source type 'string[,]'. 'Select' not found. Are you missing a reference to 'System.Core.dll' or a using directive for 'System.Linq'?

Can someone please help me with this error? The directive using System.Linq is in the code.

Miguel 2010-10-15 22:15:32

Miguel: You have a `string[,]` which is a 2-dimensional array of strings. Arrays with more than one dimension do not support LINQ. I don't know how you created it, but I've never seen one. You should either modify your question or ask a new one to get help with this.

Gabe 2010-10-15 23:34:56

ansaurus

tags:

views:

answers:

A little help needed in code translation (Python to C#)

ADDENDUM:

related questions