String processing in C# and VB.NET is easy for me, but understanding how to do the same in F# not so easy. I am reading two Apress F# books (Foundations and Expert). Most samples are number crunching and, I think, very little of string manipulation. In particular, samples of seq { sequence-expression } and Lists.
I have a C# program I want to convert to F#. Here is what it does:
- Open a txt file
- split file paragraphs, look for CRLF between paragraphs
- split paragraph lines, look for . ! ? between lines
- split line words, look for " " space between words
- output number of paragraphs, lines and words
- Loop the collection of words, find and count all ocurrences of a string within the collection, mark the locations of word found.
Here is a simple example of what I can do in C#, but not yet in F#.
Suppose this is a text file:
Order, Supreme Court, New York County (Paul G Someone), entered March 18, 2008, which, in an action for personal injuries sustained in a trip and fall over a pothole allegedly created by the negligence of defendants City or Consolidated McPherson, and a third-party action by Consolidated McPherson against its contractor (Mallen), insofar as appealed from, denied, as untimely, Mallen's motion for summary judgment dismissing the complaint and third-party complaint, unanimously affirmed, without costs.
Parties are afforded great latitude in charting their procedural course through the courts, by stipulation or otherwise. Thus, we affirm the denial of Mallen's motion as untimely since Mallen offered no excuse for the late filing.
I get this output:
2 Paragraphs
3 Lines
109 Words
Found Tokens: 2
Token insofar: ocurrence(s) 1: position(s): 52
Token thus: ocurrence(s) 1: position(s): 91
Lines should have been called Sentences :(
There are several tokens. I'd say more than 100 grouped by class. I have to iterate over the same text several times trying to match each and every token. Here is portions of the code. It shows how I split sentences, put them in ListBox, that helps easily get the item count. This works for paragraphs, sentences and tokens. And it also shows how I am relying in for and foreach. It is this approach I want to avoid by using if possible seq { sequence-expression } and Lists and seq.iter or List.iter and whatever match token ... with that are necessary.
/// <summary>
/// split the text into sentences and displays
/// the results in a list box
/// </summary>
private void btnParseText_Click(object sender, EventArgs e)
{
lstLines.Items.Clear();
ArrayList al = SplitLines(richTextBoxParagraphs.Text);
for (int i = 0; i < al.Count; i++)
//populate a list box
lstLines.Items.Add(al[i].ToString());
}
/// <summary>
/// parse a body of text into sentences
/// </summary>
private ArrayList SplitLines(string sText)
{
// array list tto hold the sentences
ArrayList al = new ArrayList();
// split the lines regexp
string[] splitLines =
Regex.Split(sText, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
// loop the sentences
for (int i = 0; i < splitLines.Length; i++)
{
string sOneLine =
splitLines[i].Replace(Environment.NewLine, string.Empty);
al.Add(sOneLine.Trim());
}
// update statistics
lblLineCount.Text = "Line Count: " +
GetLineCount(splitLines).ToString();
// words
lblWordCount.Text = "Word Count: " +
GetWordCount(al).ToString();
// tokens
lblTokenCount.Text = "Token Count: " +
GetTokenCount(al).ToString();
// return the arraylist
return al;
}
/// <summary>
/// count of all words contained in the ArrayList
/// </summary>
public int GetWordCount(ArrayList allLines)
{
// return value
int rtn = 0;
// iterate through list
foreach (string sLine in allLines)
{
// empty space is the split char
char[] arrSplitChars = {' '};
// create a string array and populate
string[] arrWords = sSentence.Split(arrSplitChars, StringSplitOptions.RemoveEmptyEntries);
rtn += arrWords.Length;
}
// return word count
return rtn;
}
In fact, it is a very simple Windows Application. A form with one RichTextBox and three ListBoxes(paragraphs, lines, tokens found), labels to display output and one button.