tags:

views:

122

answers:

6

Hey,

I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.

This is wht I could come up with:

static void Main()
        {
            string[] content = File.ReadAllLines("text.txt");

            var query = (from c in content
                         select content);

            foreach (var line in content)
            {
                Console.Write(line+"\n");
            }

        }

This reads the file line by line. If i change ReadAllLines to ReadAllText, the file is read letter by letter.

Any ideas?

+2  A: 
string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}

You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.

In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.

CodeInChaos
the problem with this is that words in the next line are appended without space to the last word of the previous line.
Deepak
Why is that a problem? The ReadAllLines function should already split these apart. And then the SelectMany splits each line even further. And the Where clause deals with consecutive whitespaces.
CodeInChaos
my bad. Sorry :D That works absolutely fine.
Deepak
I think I'd prefer to split on `new Regex(@"[^\w'-]")` to catch most non-word chars but keep ' and - within words intact. If you aren't in .NET 4, you can also write your own lazy-evaluated ReadLines from a TextReader as `for(string line = rdr.ReadLine(); line != null; line = rdr.ReadLine())yield return line;`
Jon Hanna
A: 

You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine)) but that's not a lot of linq.

Noel Abrahams
ForEach has a return type of void- this method chaining won't compile.
BleuM937
@BleuM937, actually it will.
Noel Abrahams
sorry, misread your parens.
BleuM937
A: 

Once you've identified what a word is (white spacing handling, compound words...), you'll be able to split each line to an IEnumerable of words on which you can execute any linq method.

vc 74
+1  A: 
string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' };    // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
Grozz
StringSplitOptions.RemoveEmptyEntries was new to me. Thanks.
CodeInChaos
A: 
string content = File.ReadAllText("Text.txt");

var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries) 

select word;

You will need to define the array of whitespace chars with your own values like so:

List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};

This code assumes that panctuation is a part of the word (like a comma).

Neowizard
A: 

It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:

Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
    string word = m.Groups[1].Value;

    // do your processing here

    m = m.NextMatch();
}
Waleed Eissa