views:

674

answers:

2

I have a method that performs a simplistic 'grep' across files, using an enumerable of "search strings". (Effectively, I'm doing a very naive "Find All References")

IEnumerable<string> searchStrings = GetSearchStrings();
IEnumerable<string> filesToLookIn = GetFiles();
MultiMap<string, string> references = new MultiMap<string, string>();

foreach( string fileName in filesToLookIn )
{
    foreach( string line in File.ReadAllLines( fileName ) )
    {
        foreach( string searchString in searchStrings )
        {
            if( line.Contains( searchString ) )
            {
                references.AddIfNew( searchString, fileName );
            }
        }
    }
}

Note: MultiMap<TKey,TValue> is roughly the same as Dictionary<TKey,List<TValue>>, just avoiding the NullReferenceExceptions you'd normally encounter.


I have been trying to put this into a more "functional" style, using chained LINQ extension methods but haven't figured it out.

One dead-end attempt:

// I get lost on how to do a loop within a loop here...
// plus, I lose track of the file name
var lines = filesToLookIn.Select( f => File.ReadAllLines( f ) ).Where( // ???

And another (hopefully preserving the file name this time):

var filesWithLines =
    filesToLookIn
        .Select(f => new { FileName = f, Lines = File.ReadAllLines(f) });

var matchingSearchStrings =
    searchStrings
        .Where(ss => filesWithLines.Any(
                         fwl => fwl.Lines.Any(l => l.Contains(ss))));

But I still seem to lose the information I need.

Maybe I'm just approaching this from the wrong angle? From a performance standpoint, the loops ought to perform in roughly the same order as the original example.

Any ideas of how to do this in a more compact functional representation?

+1  A: 

I would use the FindFile (FindFirstFileEx, FindNextFile, etc, etc) API calls to look in the file for the term that you are searching on. It will probably do it faster than you reading line-by-line.

However, if that won't work for you, you should consider creating an IEnumerable<String> implementation which will read the lines from the file and yield them as they are read (instead of reading them all into an array). Then, you can query on each string, and only get the next one if it is needed.

This should save you a lot of time.

Note that in .NET 4.0, a lot of the IO apis that return lines from files (or search files) will return IEnumerable implementations which do exactly what is mentioned above, in that it will search directories/files and yield them when appropriate instead of front-loading all the results.

casperOne
+8  A: 

How about:

var matches =
    from fileName in filesToLookIn
    from line in File.ReadAllLines(fileName)
    from searchString in searchStrings
    where line.Contains(searchString)
    select new
    {
        FileName = fileName,
        SearchString = searchString
    };

    foreach(var match in matches)
    {
        references.AddIfNew(match.SearchString, match.FileName);
    }

Edit:

Conceptually, the query turns each file name into a set of lines, then cross-joins that set of lines to the set of search strings (meaning each line is paired with each search string). That set is filtered to matching lines, and the relevant information for each line is selected.

The multiple from clauses are similar to nested foreach statements. Each indicates a new iteration in the scope of the previous one. Multiple from clauses translate into the SelectMany method, which selects a sequence from each element and flattens the resulting sequences into one sequence.

All of C#'s query syntax translates to extension methods. However, the compiler does employ some tricks. One is the use of anonymous types. Whenever 2+ range variables are in the same scope, they are probably part of an anonymous type behind the scenes. This allows arbitrary amounts of scoped data to flow through extension methods like Select and Where, which have fixed numbers of arguments. See this post for further details.

Here is the extension method translation of the above query:

var matches = filesToLookIn
    .SelectMany(
        fileName => File.ReadAllLines(fileName),
        (fileName, line) => new { fileName, line })
    .SelectMany(
        anon1 => searchStrings,
        (anon1, searchString) => new { anon1, searchString })
    .Where(anon2 => anon2.anon1.line.Contains(anon2.searchString))
    .Select(anon2 => new
    {
        FileName = anon2.anon1.fileName,
        SearchString = anon2.searchString
    });
Bryan Watts
I wasn't aware you could use multiple "from" statements like that. How does that actually work?My experience with LINQ has purely been through lambdas and extension methods. Does this even translate to chained extension methods?
Jonathan Mitchem
Yes, multiple from clauses translate into calls to the SelectMany extension method. Check it out in Reflector to see exactly what's happening.
dahlbyk
@jmitchem: I edited my answer to address your questions.
Bryan Watts
This does what I'm looking for, and expresses the idea much better. Thanks for the tip to SelectMany, didn't realize that was there. However, I am experiencing a noticeable performance difference: the nested 'ForEach'es take 229sec to execute; the LINQ nested 'from's take 504sec. Is this due to anonymous object creation/destruction behind the scene? Should I create a new SO question to address this issue?
Jonathan Mitchem
It could be. Opening a question couldn't hurt.
Bryan Watts