views:

170

answers:

5

I have a List<> of objects containing two strings and a DateTime. I want to build another list of the same objects containing only the last unique items using the two strings as keys and the last DateTime value. In SQL think the following:

SELECT col1, col2, MAX(datetime) FROM table GROUP BY col1, col2

This gives the unique list of col1, col2 and the last datetime. So.. I'm trying to do this in code with two lists. One with duplicates in it which parse and grab only the last unique items out of it to populate a second list.

The data sets I have are huge, so just going through the duplicate list then checking if the item is in the unique list, if it's not adding it, if it is, comparing the dates etc.. is pretty slow. So I thought I could recursively go through the duplicate list and grab the unique items find their max datetime and delete the non max ones as I loop through, making my duplicate list smaller and smaller, thus speeding things up. (i hope your still following me..)

So anyway. i wrote a recursive loop. with two lists.. but when I loop through I get a System.StackOverflowException on about the 3000th iteration.

Here's my code. Imagine the ListWithDuplicates is full of data. The actual ListDataItem has more properties I've left out. But mu main question is why can't i loop through the public list in this manner without causing the StackOverflowException?

using System;
using System.Net;
using System.IO;
using System.Collections.Generic;
using System.Linq;

public class RecursionTest
{
    public List<listDataItem> ListWithDuplicates { get; set; }
    public List<listDataItem> ListWithUniques { get; set; }

    public RecursionTest()
    {
        Process();
    }

    public void Process()
    {
        int rowcount = 0;
        int duplicates = 0;
        int total = 0;
        RecursiveLoopForUnique(ref rowcount, ref duplicates, ref total, "", "");
    }

    private void RecursiveLoopForUnique(ref int rowcount, ref int duplicates, ref int total, string col1, string col2)
    {
        if (rowcount > 0)
            duplicates += ListWithDuplicates.RemoveAll(z => z.COL1 == col1 && z.COL2 == col2);
        if (ListWithDuplicates.Count > 0)
        {
            foreach (listDataItem item in ListWithDuplicates)
            {
                rowcount++;
                if (ListWithUniques.FindAll(z => z.COL1 == item.COL1 && z.COL2 == item.COL2).Count < 1)
                {
                    ListWithUniques.Add(ListWithDuplicates.FindAll(z => z.COL1 == item.COL1 && z.COL2 == item.COL2).OrderByDescending(z => z.DATETIME).First());
                    col1 = item.COL1;
                    col2 = item.COL2;
                    break;
                }
            }
            RecursiveLoopForUnique(ref rowcount, ref duplicates, ref total, col1, col2);
        }
        else
            return;
    }

    public class listDataItem
    {
        public string COL1 { get; set; }
        public string COL2 { get; set; }
        public DateTime DATETIME { get; set; }            

        public listDataItem(string col1, string col2, DateTime datetime)
        {
            COL1 = col1;
            COL2 = col2;
            DATETIME = datetime;
        }
    }
}
+2  A: 

LINQ, yay.

listDataItem latestListDataItem =
    ListWithDuplicates.Where(item => item.COL1 == yourCol1Param && item.COL2 == yourCol2Param)
                      .Max(item => item.DATETIME);

MSDN notes on..

Where: http://msdn.microsoft.com/en-us/library/bb534803.aspx

Max: http://msdn.microsoft.com/en-us/library/bb347632.aspx

OrderBy: http://msdn.microsoft.com/en-us/library/bb534966.aspx

Last: http://msdn.microsoft.com/en-us/library/bb358775.aspx

Jimmy Hoffa
You don't need to use OrderBy(...).Last(), you can use Max(item => item.DateTime) instead.
Juliet
I have to admit, I don't understand at all how this solves his problem. What makes you think he's looking for the date on one particular combination of C1 and C2, rather than all of them?
mquander
@mquander: I am kind of assuming that's what the poster wants from one of the things he said in his post. He said "containing only the last unique items using the two strings as keys", if you use the two strings as keys you get a group of items, which are only unique by their date, and then you filter that down to the latest date and you only have one item..
Jimmy Hoffa
@Juliet: Right! I always forget about the Min() and Max().. Thanks!
Jimmy Hoffa
Thanks for your help on this. I was suprised by how quickly you responded as well. top form. I played around with the various group by linq selects provided by yourself and others, but in the end I went with using a Dictionary list and a for loop as suggested by @t_scho.
craigpj
Jimmy Hoffa, you were right. I was after the last unique based on the two columns as unique keys.. my dataListItem has other properties in it that are important to me as well, like an ID and some other data on each row/object. cheers.
craigpj
A: 

I'm not sure about the syntax, but it should be close.

from d in DupsList
group d.DATETIME on d.col1, d.col2 in grp
select new listDataItem  (grp.Key.col1, grp.Key.col2, grp.Max()};
James Curran
A: 

Well, if you have more than a few thousand unique pairs of C1, C2, then you'll encounter this, since you're recursing once for each unique group.

There are a lot of ways you could fix this up; one that would wind up much clearer and faster would be to sort the list by C1 and C2, and then go down it exactly once to find the most recent date in each group. If you aren't wedded to reimplementing it yourself, the best way is this:

ListWithUniques = ListWithDuplicates
    .GroupBy(x => new { COL1, COL2 })
    .Select(g => g.OrderByDescending(x => x.DATETIME).First())
mquander
A: 
SELECT col1, col2, MAX(datetime) FROM table GROUP BY col1, col2

in LINQ:

var query = from row in table
            group row into g
            select new
            {
                Col1 = g.Key.Col1,
                Col2 = g.Key.Col2,
                Date = g.Max(b => b.Date)
            };

And in a potentially more useful form:

var dict = query.ToDictionary(a => new { a.Col1, a.Col2 }, a => a.Date);

Then you can reference it like so:

DateTime specificMaxDate = dict[new { Col1 = 2, Col2 = 3 }];
Ian Henry
thanks Ian for your time and suggestions on this. Much appreciated.
craigpj
+1  A: 

How about this:

Dictionary<string, item> destDict = new Dictionary<string, item>();

foreach (item curr in items)
{
    string key = curr.col1 + curr.col2;
    if (!destDict.Keys.Contains(key))
    {
        destDict.Add(key, curr);
    }
    else
    {
        if (destDict[key].date < curr.date)
        {
            destDict[key].date = curr.date;
        }
    }
}

I tested this on a list containing 1000 each of 2 unique col1/col2 pairs. Worked fine and was faster than a LINQ groupby/select.

t_scho
Thanks to everyone who helped out on this one. This method provided by @t_scho worked well and it is blindingly fast. I altered it slightly by using a DateTime.CompareTo and assigned the curr object to the dictionary when the DATETIME was later than the current item in the destinct Dictionary list.
craigpj
Dictionary<string, listDataItem> destDict = new Dictionary<string, listDataItem>(); foreach (listDataItem curr in ListWithDuplicates) { string key = curr.COL1 + curr.COL2; if (!destDict.Keys.Contains(key)) destDict.Add(key, curr); else { if (curr.DATETIME.CompareTo(destDict[key].DATETIME) > 0) { destDict[key] = curr; } else duplicates++; } rowcount++; }
craigpj