ansaurus

Question

LINQ-to-objects index within a group + for different groupings (aka ROW_NUMBER with PARTITION BY equivalent)

Answer 1

+2 A:

To be honest, I'm quite confused with your question. Maybe if you should explain what you're trying to solve. Anyway, I'll try to answer what I understood.

1) First, I'll assume that you already have a list of documents grouped by Title+SourceId. For testing purposes, I hardcoded a list as follow:

var docs = new [] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
};

2) To get put a index in every item, you can use the Select extension method, passing a Func selector function. Like this:

var docsWithIndex
    = docs
    .Select( (d, i) => new { Doc = d, Index = i } );

3) From what I understood, the next step would be to group the last result by Title. Here's how to do it:

var docsGroupedByTitle
    = docsWithIndex
    .GroupBy( a => a.Doc.Title );

The GroupBy function (used above) returns an IEnumerable<IGrouping<string,DocumentWithIndex>>. Since a group is enumerable too, we now have an enumerable of enumerables.

4) Now, for each of the groups above, we'll get only the item with the minimum SourceId. To make this operation we'll need 2 levels of recursion. In LINQ, the outer level is a selection (for each group, get one of its items), and the inner level is an aggregation (get the item with the lowest SourceId):

var selectedFew
    = docsGroupedByTitle
    .Select(
     g => g.Aggregate(
      (a, b) => (a.Doc.SourceId  <= b.Doc.SourceId) ? a : b
     )
    );

Just to ensure that it works, I tested it with a simple foreach:

foreach (var a in selectedFew) Console.WriteLine(a);
//The result will be:
//{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
//{ Doc = { Title = 123, SourceId = 5 }, Index = 4 }

I'm not sure that's what you wanted. If not, please comment the answer and I can fix the answer. I hope this helps.

Obs.: All the classes used in my tests were anonymous. So, you don't really need to define a DocumentWithIndex type. Actually, I haven't even declared a Document class.

jpbochi 2009-07-26 01:04:53

Hi jpochi - dahlby's solution was a correct one. sorry I wasn't able to get back to you sooner to clarify, this was my first question on stack overflow and I never expected to get 2 answers in less than 2 hours on a Sunday! Next time I'll check back faster! :-) Anyway, thanks for the help.

Justin Grant 2009-07-26 03:53:31

No problemo. I guess you should mark his answer as accepted then.

jpbochi 2009-07-26 05:25:43

Answer 2

+4 A:

I think jpbochi missed that you want your groupings to be by pairs of values (Title+SourceId then Title+Index). Here's a LINQ query (mostly) solution:

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

First we group by Title+SourceId (I use an anonymous type because the compiler builds a good hashcode for the grouping lookup). Then we use Select to attach the grouped index to the document, which we use in our second grouping. Finally, for each group we pick the lowest SourceId.

Given this input:

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

I get this output:

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

Update: I just saw your question about grouping by Title first. You can do this using a subquery on your Title groups:

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;

dahlbyk 2009-07-26 02:40:45

Hey DahlbyK - this is great! Your solution looks good. Now I don't feel so bad about being unable to figure it out myself the first time. I discovered the Select-with-index overload but couldn't figure out how to get it into a LINQ query. Some black-belt code on your end, thanks for the help and the education in what's possible.

Justin Grant 2009-07-26 03:48:30

Answer 3

+1 A:

Method Based Syntax:

var selectedFew = docs.GroupBy(doc => new {doc.Title, doc.SourceId}, doc => doc)
                      .SelectMany((grouping) => grouping.Select((doc, index) => new {doc, index}))
                              .GroupBy(anon => new {anon.doc.Title, anon.index})
                              .Select(grouping => grouping.Aggregate((a, b) =>    a.doc.SourceId <= b.doc.SourceId ? a : b));

Would you say the above is the equivalent Method based syntax?

Dog Ears 2009-07-27 23:04:24

Yep, this emits the same (correct) results as DahlbyK's LINQ-y syntax above. Although (see Dahlby's updated query) it's probably more efficient to group by Title first so any sorting/aggregating can happen on tiny sets-- if there were a billion documents it'd make a big difference since you wouldn't have to load all of them into RAM at once. Plus, most titles won't have any duplicates at all... I hope the BCL optimzed sorting and group-by operations on one-member sets. :-)

Justin Grant 2009-07-29 14:27:29

ansaurus

tags:

views:

answers:

LINQ-to-objects index within a group + for different groupings (aka ROW_NUMBER with PARTITION BY equivalent)

related questions