views:

561

answers:

3

String processing in C# and VB.NET is easy for me, but understanding how to do the same in F# not so easy. I am reading two Apress F# books (Foundations and Expert). Most samples are number crunching and, I think, very little of string manipulation. In particular, samples of seq { sequence-expression } and Lists.

I have a C# program I want to convert to F#. Here is what it does:

  1. Open a txt file
  2. split file paragraphs, look for CRLF between paragraphs
  3. split paragraph lines, look for . ! ? between lines
  4. split line words, look for " " space between words
  5. output number of paragraphs, lines and words
  6. Loop the collection of words, find and count all ocurrences of a string within the collection, mark the locations of word found.

Here is a simple example of what I can do in C#, but not yet in F#.

Suppose this is a text file:

Order, Supreme Court, New York County (Paul G Someone), entered March 18, 2008, which, in an action for personal injuries sustained in a trip and fall over a pothole allegedly created by the negligence of defendants City or Consolidated McPherson, and a third-party action by Consolidated McPherson against its contractor (Mallen), insofar as appealed from, denied, as untimely, Mallen's motion for summary judgment dismissing the complaint and third-party complaint, unanimously affirmed, without costs.

Parties are afforded great latitude in charting their procedural course through the courts, by stipulation or otherwise. Thus, we affirm the denial of Mallen's motion as untimely since Mallen offered no excuse for the late filing.

I get this output:

2 Paragraphs
3 Lines
109 Words

Found Tokens: 2
Token insofar: ocurrence(s) 1: position(s): 52
Token thus: ocurrence(s) 1: position(s): 91

Lines should have been called Sentences :(

There are several tokens. I'd say more than 100 grouped by class. I have to iterate over the same text several times trying to match each and every token. Here is portions of the code. It shows how I split sentences, put them in ListBox, that helps easily get the item count. This works for paragraphs, sentences and tokens. And it also shows how I am relying in for and foreach. It is this approach I want to avoid by using if possible seq { sequence-expression } and Lists and seq.iter or List.iter and whatever match token ... with that are necessary.

    /// <summary>
    /// split the text into sentences and displays
    /// the results in a list box
    /// </summary>
    private void btnParseText_Click(object sender, EventArgs e)
    {
        lstLines.Items.Clear();

        ArrayList al = SplitLines(richTextBoxParagraphs.Text);
        for (int i = 0; i < al.Count; i++)
            //populate a list box
            lstLines.Items.Add(al[i].ToString());
    }


    /// <summary>
    /// parse a body of text into sentences 
    /// </summary>
    private ArrayList SplitLines(string sText)
    {

        // array list tto hold the sentences
        ArrayList al = new ArrayList();

        // split the lines regexp
        string[] splitLines = 
            Regex.Split(sText, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");

        // loop the sentences
        for (int i = 0; i < splitLines.Length; i++)
        {
            string sOneLine =
                splitLines[i].Replace(Environment.NewLine, string.Empty);
            al.Add(sOneLine.Trim());
        }

        // update statistics
        lblLineCount.Text = "Line Count: " + 
            GetLineCount(splitLines).ToString();
        // words
        lblWordCount.Text = "Word Count: " + 
            GetWordCount(al).ToString();
        // tokens
        lblTokenCount.Text = "Token Count: " +
            GetTokenCount(al).ToString();

        // return the arraylist
        return al;
    }

    /// <summary>
    /// count of all words contained in the ArrayList 
    /// </summary>
    public int GetWordCount(ArrayList allLines)
    {
        // return value
        int rtn = 0;

        // iterate through list
        foreach (string sLine in allLines)
        {
            // empty space is the split char
            char[] arrSplitChars = {' '};

            // create a string array and populate
            string[] arrWords = sSentence.Split(arrSplitChars, StringSplitOptions.RemoveEmptyEntries);
            rtn += arrWords.Length;
        }

        // return word count
        return rtn;
    }

In fact, it is a very simple Windows Application. A form with one RichTextBox and three ListBoxes(paragraphs, lines, tokens found), labels to display output and one button.

A: 

Could you post your C#-program? (Edit your question)

I think you can implement this in a very similar way in F# unless your original code is heavily based on changing variables (for which I don't see reasons in your problem description).

In case you used String.Split in C#: It's basically the same thing:

open System
let results = "Hello World".Split [|' '|]
let results2 = "Hello, World".Split ([| ", "|], StringSplitOptions.None)

In order to concatenate the resulting sequences, you can combine yield and yield!.

Abstract example

let list = [ yield! [1..8]; for i in 3..10 do yield i * i ]
Dario
well, yes, split is the same. I also use regexp. What I do not know is how to put paragraphs, lines and words into sequences. I read samples of List.iter and seq.iter and i get it for numbers, but not for strings. In C# i am putting everything in ArrayList then relying too much on foreach paragraph in paragraphs...foreach line in lines...foreach word in words..., and then there are the collection of tokens. There must be simpler ways to do that in F# using sequences or lists avoiding mimicking the imperative style of my current solution.
dde
You can express this in F#'s list generator syntax. Edited my post
Dario
+1  A: 

You should post your C# code in the question (sounds a bit like homework, people will have more faith if you demonstrate you've already done the effort in one language and are really trying to learn more about another).

There isn't necessarily much F#-specific here, you can do this pretty similarly in any .Net language. There are a number of strategies you can use, for example below I use regular expressions for lexing out the words... only a couple F# idioms below, though.

open System
open System.Text.RegularExpressions 

let text = @"Order, Supreme Court, New York County (Paul G Someone), entered 
March 18, 2008, which, in an action for personal injuries sustained in a 
trip and fall over a pothole allegedly created by the negligence of 
defendants City or Consolidated McPherson, and a third-party action by 
Consolidated McPherson against its contractor (Mallen), insofar as appealed 
from, denied, as untimely, Mallen's motion for summary judgment dismissing 
the complaint and third-party complaint, unanimously affirmed, without costs.

Parties are afforded great latitude in charting their procedural course 
through the courts, by stipulation or otherwise. Thus, we affirm the denial 
of Mallen's motion as untimely since Mallen offered no excuse for the late 
filing."

let lines = text.Split([|'\n'|])
// If was in file, could use
//let lines = System.IO.File.ReadAllLines(@"c:\path\filename.txt")
// just like C#.  For this example, assume have giant string above

let fullText = String.Join(" ", lines)
let numParagraphs = 
    let mutable count = 1
    for line in lines do
        // look for blank lines, assume each delimits another paragraph
        if Regex.IsMatch(line, @"^\s*$") then
            count <- count + 1
    count
let numSentences =     
    let mutable count = 1
    for c in fullText do
        if c = '.' || c = '!' || c = '?' then
            count <- count + 1
    count
let words =
    let wordRegex = new Regex(@"\b(\w+)\b")
    let fullText = String.Join(" ", lines)
    [| for m in wordRegex.Matches(fullText) do
        yield m.Value |]
printfn "%d paragraphs" numParagraphs
printfn "%d sentences" numSentences
printfn "%d words" words.Length
let Find token =
    words |> Seq.iteri (fun n word ->
        if 0=String.Compare(word, token, 
                            StringComparison.OrdinalIgnoreCase) then
            printfn "Found %s at word %d" word n
    )
let tokensToFind = ["insofar"; "thus"; "the"]
for token in tokensToFind do
    Find token
Brian
AAAAAAArh -- Unnecessary mutable values!In order to be functional, I'd replace them using a recursive Count-function
Dario
It works Dario, it works. And, consider what I am writing "Most samples are number crunching and, I think, very little of string manipulation"...So few examples on string manipulation using seq and list it seems like it is all about numbers. I get frustrated because most of my developments are string manipulation related: Syntactic Analysis, Semantic Analysis...and again there are few samples. I am new to F#, book chapters on Lexing and Parsing are just too much for now. Amazing Brian (both you and how it is so much less coding in F#). Big thanks.
dde
+3  A: 

Brian has a good start, but functional code will focus more on "what" you're trying to do than "how".

We can start out in a similar same way:

open System
open System.Text.RegularExpressions 

let text = @"Order, Supreme Court, New York County (Paul G Someone), entered..."

let lines = text.Split([|Environment.NewLine|], StringSplitOptions.None)

First, let's look at paragraphs. I like Brian's approach to count blank lines separating paragraphs. So we filter to find only blank lines, count them, then return our paragraph count based on that value:

let numParagraphs = 
    let blankLines = lines |> Seq.filter (fun line -> Regex.IsMatch(line, @"^\s*$"))
                           |> Seq.length
    blankLines + 1

For sentences, we can view the full text as a sequence of characters and count the number of sentence-ending characters. Because it's F#, let's use pattern matching:

let numSentences =
    let isSentenceEndChar c = match c with
                              | '.' | '!' | '?' -> true
                              | _ -> false
    text |> Seq.filter isSentenceEndChar
         |> Seq.length

Matching words can be as easy as a simple regular expression, but could vary with how you want to handle punctuation:

let words = Regex.Split(text, "\s+")
let numWords = words.Length

numParagraphs |> printfn "%d paragraphs" 
numSentences  |> printfn "%d sentences"
numWords      |> printfn "%d words"

Finally, we define a function to print token occurences, which is easily applied to a list of tokens:

let findToken token =
    let tokenMatch (word : string) = word.Equals(token, StringComparison.OrdinalIgnoreCase)
    words |> Seq.iteri (fun n word ->
        if tokenMatch word then
            printfn "Found %s at word %d" word n
    )

let tokensToFind = ["insofar"; "thus"; "the"]
tokensToFind |> Seq.iter findToken

Note that this code does not find "thus" because of its trailing comma. You will likely want to adjust either how words is generated or tokenMatch is defined.

dahlbyk