tags:

views:

33

answers:

2

I'm reading a binary file into a bindinglist of (t); to be bound to a datagridview.
Each line in the file represents a single transaction but I need to consolidate and or filter transactions that meet certain criteria.

I know how to do this from a mechanical standpoint (looping the list while adding each item, and adding a new item, or merging the data with an existing item), but I'm looking for a practice, pattern, existing components, or something else that I'm missing (I'm drawing a blank for keywords to search).

I don't want to reinvent the wheel if I don't have too. I'm particularly concerned with speed and performance issues with 100k plus records to process in some instances.

Currently working with .NET 2.0, but will move to 3.5 if a particularly sexy solution exists.


Update I've changed the solution to 3.5 so that's no longer an issue. I should have pointed out that this project is VB.NET, but I may add a new C# library for this particular function to take advantage of C# Iterators.

+1  A: 

Yes, you want 3.5 because this gives you LINQ -- language-integrated query.

There is a slight performance cost, but for huge recordsets you can offset this by using PLINQ (parallel processing).

LINQ is a declarative, functional way to deal with sets.

Concepts you'll need:
- lambda expressions () =>
- extension methods

Consider from a set of 10,000 strings you want the first 100 that are longer than 24 characters:

var result = source.Where(s => s.Length > 24).Take(100);

From a set of Person objects you want to return the names, but they are divided into firstName and lastName properties.

var result = source.Select(person => person.firstName + person.LastName);

This returns IEnumerable<string>.

From the same set you want the average age:

var result = source.Average(person => person.Age);

Youngest 10 people:

var result = source.OrderBy(person => person.Age).Take(10);

Everybody, grouped by the first letter of their last names:

var result = source.GroupBy(person => person.lastName[0]);

This returns IGrouping<char, Person>

Names of the oldest 25 people whose last name starts with S:

var result = source.Where(person => person.lastName.StartsWith("S"))
   .OrderByDescending(person => person.Age)
   .Take(25)
   .Select(person => person.firstName + person.lastName);

Just imagine how much code you'd have to write in an foreach loop to accomplish this, and how much room there would be to introduce defects or missed optimizations among that code. The declarative nature of the LINQ syntax makes it easier to read and maintain.

There is an alternate syntax that is sort of SQL-ish, but shows how you are really defining queries against an arbitrary set of objects. Consider that you want to get people whose first name is "Bob":

var result = 
    from person in source
    where person.firstName == "Bob"
    select person;

It looks bizarre, but this is valid C# code if you jump up from 2.0.

My only warning is that once you work with LINQ you may refuse ever to work in 2.0 again.

There are lots of great resources available for learning LINQ syntax -- it doesn't take long.


update

Additional considerations in response to 1st comment:

You already have a very powerful tool at your disposal with C# 2.0 -- iterators.

Consider:

public class Foo
{
    private IEnumerable<Record> GetRecords()
    {
        Record record = // do I/O stuff, instantiate a record

        yield return record;
    }

    public void DisplayRecords()
    {
        foreach (Record record in GetRecords())
        {
            // do something meaningful

            // display the record
        }
    }
}

So, what is remarkable about this? The GetRecords() method is an iterator block, and the yield keyword returns results as requested ("lazy evaluation").

This means that when you call DisplayRecords(), it will call GetRecords(), but as soon as GetRecords() has a Record, it will return it to DisplayRecords(), which can then do something useful with it. When the foreach block loops again, execution will return to GetRecords(), which will return the next item, and so on.

In this way, you don't have to wait for 100,000 records to be read from disk before you can start sorting and displaying results.

This gives some interesting possibilities; whether or not this can be made useful in your situation is up to you (you wouldn't want to refresh the grid binding 100,000 times, for example).

Jay
I had considered LINQ, but I'd have to load hundreds of thousands of records into the list (from hundreds of files) before being able to filter. Ideally I'd like to evaluate each item as I read it from disk. I also don't think that LINQ will allow me to consolidate like items, incrementing a Quantity field, but I could be wrong.
Robert Lee
@Robert I added another thought on the matter. LINQ itself would not be able to consolidate like items, but you can use it to easily group those items, and then you could call a `Consolidate()` [extension?] method to combine them.
Jay
There are LINQ-like libraries for 2.0 since all you need are iterators and anonymous delegates. All you lose by not moving to 3.5 is the sweet, sweet syntactic sugar.
Gabe
A: 

It sounds like you want to do something like this in pseudo-LINQ: data.GroupBy().Select(Count()).Where() -- you group (consolidate) by some criteria, count the number in each group, and then filter by the results.

However, you suggest that you may have too much data to load into memory all at once, so you want to consolidate as you load the data. This can be accomplished with your own GroupByCount operator, somewhat like this over-simplified version:

public static IEnumerable<KeyValuePair<T, int>>
    GroupByCount<T>(IEnumerable<T> input)
{
    Dictionary<T, int> counts = new Dictionary<T, int>();
    foreach (T item in input)
        if (counts.ContainsKey(item))
            counts[item]++;
        else
            counts[item] = 1;
    return counts;
}

Then you would just have data.GroupByCount().Where() and your data would all be consolidated as it loads because the foreach would only load the next item after processing the previous one.

Gabe
This is conceptually similar to what I have implemented at this point, but since the user may want to group by multiple fields in the object I have to evaluate grouped fields and only consolidate if all of the fields match. If they do match, I also have to merge some fields ($amounts, etc) in the objects. Count is also a field, but is only incremented as items merge.
Robert Lee