After much Google searching and code experimentation, I'm stumped on a complex C# LINQ-to-objects problem which in SQL would be easy to solve with a pair of ROW_NUMBER()...PARTITION BY functions and a subquery or two.
Here's, in words, what I'm trying to do in code-- the underlying requirement is removing duplicate documents from a list:
- First, group a list by (Document.Title, Document.SourceId), assuming a (simplified) class definition like this:
class Document { string Title; int SourceId; // sources are prioritized (ID=1 better than ID=2) }
-
Within that group, assign each document an index (e.g. Index 0 == 1st document with this title from this source, Index 1 = 2nd document with this title from this source, etc.). I'd love the equivalent of ROW_NUMBER() in SQL!
- Now group by (Document.Title, Index), where Index was computed in Step #2. For each group, return only one document: the one with the lowest Document.SourceId.
Step #1 is easy (e.g. codepronet.blogspot.com/2009/01/group-by-in-linq.html), but I'm getting stumped on steps #2 and #3. I can't seem to build a red-squiggle-free C# LINQ query to solve all three steps.
Anders Heilsberg's post on this thread is I think the answer to Steps #2 and #3 above if I could get the syntax right.
I'd prefer to avoid using an external local variable to do the Index computation, as recommended on slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html, since that solution breaks if the external variable is modified.
Optimally, the group-by-Title step could be done first, so the "inner" groupings (first by Source to compute the index, then by Index to filter out duplicates) can operate on small numbers of objects in each "by title" group, since the # of documents in each by-title group is usually under 100. I really don't want an N2 solution!
I could certainly solve this with nested foreach loops, but it seems like the kind of problem which should be simple with LINQ.
Any ideas?