views:

96

answers:

2

I'd hate to reinvent something that was already written, so I'm wondering if there is a ReadWord() function somewhere in the .NET Framework that extracts words based some text delimited by white space and line breaks.

If not, do you have a implementation that you'd like to share?

string data = "Four score and seven years ago";
List<string> words = new List<string>();
WordReader reader = new WordReader(data);

while (true)
{
   string word =reader.ReadWord();
   if (string.IsNullOrEmpty(word)) return;
   //additional parsing logic goes here
   words.Add(word);
}
+5  A: 

Not that I'm aware of directly. If you don't mind getting them all in one go, you could use a regular expression:

Regex wordSplitter = new Regex(@"\W+");
string[] words = wordSplitter.Split(data);

If you have leading/trailing whitespace you'll get an empty string at the beginning or end, but you could always call Trim first.

A different option is to write a method which reads a word based on a TextReader. It could even be an extension method if you're using .NET 3.5. Sample implementation:

using System;
using System.IO;
using System.Text;

public static class Extensions
{
    public static string ReadWord(this TextReader reader)
    {
        StringBuilder builder = new StringBuilder();
        int c;

        // Ignore any trailing whitespace from previous reads            
        while ((c = reader.Read()) != -1)
        {
            if (!char.IsWhiteSpace((char) c))
            {
                break;
            }
        }
        // Finished?
        if (c == -1)
        {
            return null;
        }

        builder.Append((char) c);
        while ((c = reader.Read()) != -1)
        {
            if (char.IsWhiteSpace((char) c))
            {
                break;
            }
            builder.Append((char) c);
        }
        return builder.ToString();
    }
}

public class Test
{
    static void Main()
    {
        // Give it a few challenges :)
        string data = @"Four score     and

seven years ago    ";

        using (TextReader reader = new StringReader(data))
        {
            string word;

            while ((word = reader.ReadWord()) != null)
            {
                Console.WriteLine("'{0}'", word);
            }
        }
    }
}

Output:

'Four'
'score'
'and'
'seven'
'years'
'ago'
Jon Skeet
+5  A: 

Not as such, however you could use String.Split to split the string into an array of string based on a delimiting character or string. You can also specify multiple strings / characters for the split.

If you'd prefer to do it without loading everything into memory then you could write your own stream class that does it as it reads from a stream but the above is a quick fix for small amounts of data word splitting.

RobG