views:

248

answers:

2

This little program finds the top ten most used words in a file. How would you, or could you, optimize this to process the file via line-by-line streaming, but keep it in the functional style it is now?

    static void Main(string[] args)
    {
        string path = @"C:\tools\copying.txt";

        File.ReadAllText(path)
            .Split(' ')
            .Where(s => !string.IsNullOrEmpty(s))
            .GroupBy(s => s)
            .OrderByDescending(g => g.Count())
            .Take(10)
            .ToList()
            .ForEach(g => Console.WriteLine("{0}\t{1}", g.Key, g.Count()));

        Console.ReadLine();
    }

Here is the line reader I'd like use:

    static IEnumerable<string> ReadLinesFromFile(this string filename)
    {
        using (StreamReader reader = new StreamReader(filename))
        {
            while (true)
            {
                string s = reader.ReadLine();

                if (s == null)
                    break;

                yield return s;
            }
        }
    }

Edit:

I realize that the implementation of top-words doesn't take into account punctuation and all the other little nuances, and I'm not too worried about that.

Clarification:

I'm interested in solution that doesn't load the entire file into memory at once. I suppose you'd need a data structure that could take a stream of words and "group" on the fly -- like a trie. And then somehow get it done in a lazy way so the line reader can go about it's business line-by-line. I'm now realizing that this is a lot to ask for and is a lot more complex than the simple example I gave above. Maybe I'll give it a shot and see if I can get the code as clear as above (with a bunch of new lib support).

+4  A: 

So what you're saying is you want to go from:

full text -> sequence of words -> rest of query

to

sequence of lines -> sequence of words -> rest of query

yes?

that seems straightforward.

var words = from line in GetLines()
            from word in line.Split(' ')
            select word;

and then

words.Where( ... blah blah blah

Or, if you prefer using the "fluent" style throughout, the SelectMany() method is the one you want.

I personally would not do this all in one go. I'd make the query, and then write a foreach loop. That way, the query is built free of side effects, and the side effects are in the loop, where they belong. But some people seem to prefer putting their side effects into a ForEach method instead.

UPDATE: There's a question as to how "lazy" this query is.

You are correct in that what you end up with is an in-memory representation of every word in the file; however, with my minor reorganization of it, you at least do not have to create one big string that contains the entire text to begin with; you can do it line by line.

There are ways to cut down on how much duplication there is here, which we'll come to in a minute. However, I want to keep talking for a bit about how to reason about laziness.

A great way to think about these things is due to Jon Skeet, which I shall shamelessly steal from him.

Imagine a stage upon which there is a line of people. They are wearing shirts that say GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach.

ToList pokes Take. Take does something and then hands toList a card with a list of words on it. ToList keeps on poking Take until Take says "I'm done". At that point, ToList makes a list out of all the cards it has been handed, and then hands the first one off to ForEach. The next time it is poked, it hands out the next card.

What does Take do? Every time it is poked it asks OrderByDescending for another card, and immediately hands that card to ToList. After handing out ten cards, it tells ToList "I'm done".

What does OrderByDescending do? When it is poked for the first time, it pokes GroupBy. GroupBy hands it a card. It keeps on poking GroupBy until GroupBy says "I'm done". Then OrderByDescending sorts the cards, and hands the first one to Take. Every subsequent time it is poked, it hands a new card to Take, until Take stops asking.

GetLines, Split, Where, GroupBy, OrderByDescending, Take, ToList and ForEach

And so on. You see how this goes. The query operators GetLines, Split, Where, GroupBy, OrderByDescending, Take are lazy, in that they do not act until poked. Some of them (OrderByDescending, ToList, GroupBy), need to poke their card provider many, many times before they can respond to the guy poking them. Some of them (GetLines, Split, Where, Take) only poke their provider once when they are themselves poked.

Once ToList is done, ForEach pokes ToList. ToList hands ForEach a card off its list. Foreach counts the words, and then writes a word and a count on the whiteboard. ForEach keeps on poking ToList until ToList says "no more".

(Notice that the ToList is completely unnecessary in your query; all it does is accumulate the results of the top ten into a list. ForEach could be talking directly to Take.)

Now, as for your question of whether you can reduce the memory footprint further: yes, you can. Suppose the file is "foo bar foo blah". Your code builds up the set of groups:

{ 
    { key: foo, contents: { foo, foo } },
    { key: bar, contents: { bar } },
    { key: blah, contents: { blah } }
}

and then orders those by the length of the contents list, and then takes the top ten. You don't have to store nearly that much in the contents list in order to compute the answer you want. What you really want to be storing is:

{ 
    { key: foo, value: 2 },
    { key: bar, value: 1 },
    { key: blah, value: 1 }
}

and then sort that by value.

Or, alternately, you could build up the backwards mapping

{ 
    { key: 2, value: { foo } },
    { key: 1, value: { bar, blah }}
}

sort that by key, and then do a select-many on the lists until you have extracted the top ten words.

The concept you want to look at in order to do either of these is the "accumulator". An accumulator is an object which efficiently "accumulates" information about a data structure while the data structure is being iterated over. "Sum" is an accumulator of a sequence of numbers. "StringBuilder" is often used as an accumulator on a sequence of strings. You could write an accumulator which accumulates counts of words as the list of words is walked over.

The function you want to study in order to understand how to do this is Aggregate:

http://msdn.microsoft.com/en-us/library/system.linq.enumerable.aggregate.aspx

Good luck!

Eric Lippert
I guess I was looking for more of an optimization about how you could find the top-used words in a very large file -- while not having to load the entire file into memory. I suppose you'd need a special data structure (like a trie) to do this, but just wanted to see if anyone had a creative solution. Yeah, it's not pure but I'm not too worried about the side effect in this little example.
jsw
I'm not understanding your comment, @jsw. Isn't that exactly what we're talking about? In the sketch in my answer you get one line at a time, split it, pull out the words, and so on. When is the entire file ever in memory? Remember, everything is computed *lazily* when possible.
Eric Lippert
At first glance I thought that this solution would require that all the "words" of the file would have to be in memory (ie: duplicate words in memory -- not the entire file but all of it without whitespace) for it to work, but I didn't seriously consider where the laziness would end. Is the only way to test this by inserting "indicator" method calls/breakpoints to see how it's executing? Or how would you approach testing its behavior?
jsw
@jsw, I've updated my answer with some thoughts on your questions.
Eric Lippert
+1  A: 

First, let's abstract away our file into an IEnumerable<string> where the lines are yielded one at a time:

class LineReader : IEnumerable<string> {
    Func<TextReader> _source;
    public LineReader(Func<Stream> streamSource) {
        _source = () => new StreamReader(streamSource());
    }

    public IEnumerator<string> GetEnumerator() {
        using (var reader = _source()) {
            string line;
            while ((line = reader.ReadLine()) != null) {
                yield return line;
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator() {
        return GetEnumerator();   
    }
}

Next, let's make an extension method on IEnumerable<string> that will yield the words in each line:

static class IEnumerableStringExtensions {
    public static IEnumerable<string> GetWords(this IEnumerable<string> lines) {
        foreach (string line in lines) {
            foreach (string word in line.Split(' ')) {
                yield return word;
            }
        }
    }
}

Then:

var lr = new LineReader(() => new FileStream("C:/test.txt", FileMode.Open));
var dict = lr.GetWords()
             .GroupBy(w => w)
             .ToDictionary(w => w.Key, w => w.Count());
foreach (var pair in dict.OrderByDescending(kvp => kvp.Value).Take(10)) {
    Console.WriteLine("{0}: {1}", pair.Key, pair.Value);
}
Jason