views:

223

answers:

6

I'm writing a program as follows:

  • Find all files with the correct extension in a given directory
  • Foreach, find all occurrences of a given string in those files
  • Print each line

I'd like to write this in a functional way, as a series of generator functions (things that call yield return and only return one item at a time lazily-loaded), so my code would read like this:

IEnumerable<string> allFiles = GetAllFiles();
IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
IEnumerable<string> contents = GetFileContents( matchingFiles );
IEnumerable<string> matchingLines = GetMatchingLines( contents );

foreach( var lineText in matchingLines )
  Console.WriteLine( "Found: " + lineText );

This is all fine, but what I'd also like to do is print some statistics at the end. Something like this:

Found 233 matches in 150 matching files. Scanned 3,297 total files in 5.72s

The problem is, writing the code in a 'pure functional' style like above, each item is lazily loaded.
You only know how many files match in total until the final foreach loop completes, and because only one item is ever yielded at a time, the code doesn't have any place to keep track of how many things it's found previously. If you invoke LINQ's matchingLines.Count() method, it will re-enumerate the collection!

I can think of many ways to solve this problem, but all of them seem to be somewhat ugly. It strikes me as something that people are bound to have done before, and I'm sure there'll be a nice design pattern which shows a best practice way of doing this.

Any ideas? Cheers

A: 

Assuming those functions are your own, the only thing I can think of is the Visitor pattern, passing in an abstract visitor function that calls you back when each thing happens. For example: pass an ILineVisitor into GetFileContents (which I'm assuming breaks up the file into lines). ILineVisitor would have a method like OnVisitLine(String line), you could then implement the ILineVisitor and make it keep the appropriate stats. Rinse and repeat with a ILineMatchVisitor, IFileVisitor etc. Or you could use a single IVisitor with an OnVisit() method which has a different semantic in each case.

Your functions would each need to take a Visitor, and call it's OnVisit() at the appropriate time, which may seem annoying, but at least the visitor could be used to do lots of interesting things, other than just what you're doing here. In fact you could actually avoid writing GetMatchingLines by passing a visitor that checks for the match in OnVisitLine(String line) into GetFileContents.

Is this one of the ugly things you'd already considered?

Jesse Pepper
I hadn't considered the visitor pattern because yield return obsoletes it... well at least that's what I'd thought. Thanks for the food for thought :-)
Orion Edwards
I can't see why it would? The lazy evaluation shouldn't stop you from being able to call a visit function, surely?
Jesse Pepper
+2  A: 

I would say that you need to encapsulate the process into a 'Matcher' class in which your methods capture statistics as they progress.

public class Matcher
{
  private int totalFileCount;
  private int matchedCount;
  private DateTime start;
  private int lineCount;
  private DateTime stop;

  public IEnumerable<string> Match()
  {
     return GetMatchedFiles();
     System.Console.WriteLine(string.Format(
       "Found {0} matches in {1} matching files." + 
       " {2} total files scanned in {3}.", 
       lineCount, matchedCount, 
       totalFileCount, (stop-start).ToString());
  }

  private IEnumerable<File> GetMatchedFiles(string pattern)
  {
     foreach(File file in SomeFileRetrievalMethod())
     {
        totalFileCount++;
        if (MatchPattern(pattern,file.FileName))
        {
          matchedCount++;
          yield return file;
        }
     }
  }
}

I'll stop there since I'm supposed to be coding work stuff, but the general idea is there. The entire point of 'pure' functional program is to not have side effects, and this type of statics calculation is a side effect.

Steve Mitcham
+2  A: 

I can think of two ideas

  1. Pass in a context object and return (string + context) from your enumerators - the purely functional solution

  2. use thread local storage for you statistics (CallContext), you can be fancy and support a stack of contexts. so you would have code like this.

    using (var stats = DirStats.Create())
    {
        IEnumerable<string> allFiles = GetAllFiles();
        IEnumerable<string> matchingFiles = GetMatches( "*.txt", allFiles );
        IEnumerable<string> contents = GetFileContents( matchingFiles );
        stats.Print()
        IEnumerable<string> matchingLines = GetMatchingLines( contents );
        stats.Print();
    }
    
Sam Saffron
returning string + context seems like the best way to go thus far
Orion Edwards
+1  A: 

If you're happy to turn your code upside down, you might be interested in Push LINQ. The basic idea is to reverse the "pull" model of IEnumerable<T> and turn it into a "push" model with observers - each part of the pipeline effectively pushes its data past any number of observers (using event handlers) which typically form new parts of the pipeline. This gives a really easy way to hook up multiple aggregates to the same data.

See this blog entry for some more details. I gave a talk on it in London a while ago - my page of talks has a few links for sample code, the slide deck, video etc.

It's a fun little project, but it does take a bit of getting your head around.

Jon Skeet
+2  A: 

In a similar vein to other answers, but taking a slightly more generic approach ...

... why not create a Decorator class that can wrap an existing IEnumerable implementation and calculate the statistic as it passes other items through.

Here's a Counter class I just threw together - but you could create variations for other kinds of aggregation too.

public class Counter<T> : IEnumerable<T>
{
    public int Count { get; private set; }

    public Counter(IEnumerable<T> source)
    {
        mSource = source;
        Count = 0;
    }

    public IEnumerator<T> GetEnumerator()
    {
        foreach (var T in mSource)
        {
            Count++;
            yield return T;
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        foreach (var T in mSource)
        {
            Count++;
            yield return T;
        }
    }

    private IEnumerable<T> mSource;
}

You could create three instances of Counter:

  1. One to wrap GetAllFiles() counting the total number of files;
  2. One to wrap GetMatches() counting the number of matching files; and
  3. One to wrap GetMatchingLines() counting the number of matching lines.

The key with this approach is that you're not layering multiple responsibilities onto your existing classes/methods - the GetMatchingLines() method only handles the matching, you're not asking it to track stats as well.

Clarification in response to a comment by Mitcham:

The final code would look something like this:

var files = new Counter<string>( GetAllFiles());
var matchingFiles = new Counter<string>(GetMatches( "*.txt", files ));
var contents = GetFileContents( matchingFiles );
var linesFound = new Counter<string>(GetMatchingLines( contents ));

foreach( var lineText in linesFound )
    Console.WriteLine( "Found: " + lineText );

string message 
    = String.Format( 
        "Found {0} matches in {1} matching files. Scanned {2} files",
        linesFound.Count,
        matchingFiles.Count,
        files.Count);
Console.WriteLine(message);

Note that this is still a functional approach - the variables used are immutable (more like bindings than variables), and the overall function has no side-effects.

Bevan
In the purely functional approach (I know my answer to this was not purely functional either), the 'Get' methods do not stick around, so how would you query the Count property after the call was completed?
Steve Mitcham
This is the solution I actually arrived at myself after some more offline thinking last night. Thanks :-)
Orion Edwards
@Mitcham - most functional languages have some kind of "let" statement that allows you to reference a single result multiple times. In this case, since we're using C#, I'd use a local variable to hold a reference to each counter instance long enough to get at the final value.
Bevan
A: 

I took Bevan's code and refactored it around until I was content. Fun stuff.

public class Counter
{
    public int Count { get; set; }
}

public static class CounterExtensions
{
    public static IEnumerable<T> ObserveCount<T>
      (this IEnumerable<T> source, Counter count)
    {
        foreach (T t in source)
        {
            count.Count++;
            yield return t;
        }
    }

    public static IEnumerable<T> ObserveCount<T>
      (this IEnumerable<T> source, IList<Counter> counters)
    {
        Counter c = new Counter();
        counters.Add(c);
        return source.ObserveCount(c);
    }
}


public static class CounterTest
{
    public static void Test1()
    {
        IList<Counter> counters = new List<Counter>();
  //
        IEnumerable<int> step1 =
            Enumerable.Range(0, 100).ObserveCount(counters);
  //
        IEnumerable<int> step2 =
            step1.Where(i => i % 10 == 0).ObserveCount(counters);
  //
        IEnumerable<int> step3 =
            step2.Take(3).ObserveCount(counters);
  //
        step3.ToList();
        foreach (Counter c in counters)
        {
            Console.WriteLine(c.Count);
        }
    }
}

Output as expected: 21, 3, 3

David B