So what you're saying is you want to go from:
full text -> sequence of words -> rest of query
to
sequence of lines -> sequence of words -> rest of query
yes?
that seems straightforward.
var words = from line in GetLines()
from word in line.Split(' ')
select word;
and then
words.Where( ... blah blah blah
Or, if you prefer using the "fluent" style throughout, the SelectMany() method is the one you want.
I personally would not do this all in one go. I'd make the query, and then write a foreach loop. That way, the query is built free of side effects, and the side effects are in the loop, where they belong. But some people seem to prefer putting their side effects into a ForEach method instead.
UPDATE: There's a question as to how "lazy" this query is.
You are correct in that what you end up with is an in-memory representation of every word in the file; however, with my minor reorganization of it, you at least do not have to create one big string that contains the entire text to begin with; you can do it line by line.
There are ways to cut down on how much duplication there is here, which we'll come to in a minute. However, I want to keep talking for a bit about how to reason about laziness.
A great way to think about these things is due to Jon Skeet, which I shall shamelessly steal from him.
Imagine a stage upon which there is a line of people. They are wearing shirts that say GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach.
ToList pokes Take. Take does something and then hands toList a card with a list of words on it. ToList keeps on poking Take until Take says "I'm done". At that point, ToList makes a list out of all the cards it has been handed, and then hands the first one off to ForEach. The next time it is poked, it hands out the next card.
What does Take do? Every time it is poked it asks OrderByDescending for another card, and immediately hands that card to ToList. After handing out ten cards, it tells ToList "I'm done".
What does OrderByDescending do? When it is poked for the first time, it pokes GroupBy. GroupBy hands it a card. It keeps on poking GroupBy until GroupBy says "I'm done". Then OrderByDescending sorts the cards, and hands the first one to Take. Every subsequent time it is poked, it hands a new card to Take, until Take stops asking.
GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach
And so on. You see how this goes. The query operators GetLines, Split, Where, GroupBy, OrderByDescending, Take are lazy, in that they do not act until poked. Some of them (OrderByDescending, ToList, GroupBy), need to poke their card provider many, many times before they can respond to the guy poking them. Some of them (GetLines, Split, Where, Take) only poke their provider once when they are themselves poked.
Once ToList is done, ForEach pokes ToList. ToList hands ForEach a card off its list. Foreach counts the words, and then writes a word and a count on the whiteboard. ForEach keeps on poking ToList until ToList says "no more".
(Notice that the ToList is completely unnecessary in your query; all it does is accumulate the results of the top ten into a list. ForEach could be talking directly to Take.)
Now, as for your question of whether you can reduce the memory footprint further: yes, you can. Suppose the file is "foo bar foo blah". Your code builds up the set of groups:
{
{ key: foo, contents: { foo, foo } },
{ key: bar, contents: { bar } },
{ key: blah, contents: { blah } }
}
and then orders those by the length of the contents list, and then takes the top ten. You don't have to store nearly that much in the contents list in order to compute the answer you want. What you really want to be storing is:
{
{ key: foo, value: 2 },
{ key: bar, value: 1 },
{ key: blah, value: 1 }
}
and then sort that by value.
Or, alternately, you could build up the backwards mapping
{
{ key: 2, value: { foo } },
{ key: 1, value: { bar, blah }}
}
sort that by key, and then do a select-many on the lists until you have extracted the top ten words.
The concept you want to look at in order to do either of these is the "accumulator". An accumulator is an object which efficiently "accumulates" information about a data structure while the data structure is being iterated over. "Sum" is an accumulator of a sequence of numbers. "StringBuilder" is often used as an accumulator on a sequence of strings. You could write an accumulator which accumulates counts of words as the list of words is walked over.
The function you want to study in order to understand how to do this is Aggregate:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.aggregate.aspx
Good luck!