views:

14

answers:

1

Currently we have a large DataTable (~152k rows) and are doing a for each over this to find a sub set of distinct entries (~124K rows). This is currently taking about 14 minutes to run which is just far too long.

As we are stuck in .NET 2.0 as our reporting won't work with VS 2008+ I can't use linq, though I don't know if this will be any faster in fairness.

Is there a better way to find the distinct lines (invoice numbers in this case) other than this for each loop?

This is the code:

Public Shared Function SelectDistinctList(ByVal SourceTable As DataTable, _
                                          ByVal FieldName As String) As List(Of String)
    Dim list As New List(Of String)
    For Each row As DataRow In SourceTable.Rows
        Dim value As String = CStr(row(FieldName))
        If Not list.Contains(value) Then
            list.Add(value)
        End If
    Next
    Return list

End Function
+1  A: 

Using a Dictionary rather than a List will be quicker:

    Dim seen As New Dictionary(Of String, String)
    ...
        If Not seen.ContainsKey(value) Then
            seen.Add(value, "")
        End If

When you search a List, you're comparing each entry with value, so by the end of the process you're doing ~124K comparisons for each record. A Dictionary, on the other hand, uses hashing to make the lookups much quicker.

When you want to return the list of unique values, use seen.Keys.

(Note that you'd ideally use a Set type for this, but .NET 2.0 doesn't have one.)

RichieHindle
That seems to have gotten it down to sub 1 second, looks like that was 1 massive bottle neck!
themaninthesuitcase