views:

62

answers:

3

I'm trying to LINQ two tables based on a dynamic key. User can change key via a combo box. Key may be money, string, double, int, etc. Currently I'm getting the data just fine, but without filtering out the doubles. I can filter the double in VB, but it's slooooow. I'd like to do it in the LINQ query right out of the gate.

Here's the data:

First Table:

 -------------------------------------------------------------
| AppleIndex  | AppleCost  | AppleColor  | AppleDescription   |
 ------------------------------------------------------------
|     1       |     3      | Red         | This is an apple   |
|     2       |     5      | Green       | This is an apple   |
|     3       |     4      | Pink        | This is an apple   |
|     4       |     2      | Yellow      | This is an apple   |
|     5       |     2      | Orange      | This is an apple   |
|     1       |     3      | Red         | This is a duplicate|
|     2       |     5      | Green       | This is a duplicate|
|     3       |     4      | Pink        | This is a duplicate|
|     4       |     2      | Yellow      | This is a duplicate|
|     5       |     2      | Orange      | This is a duplicate|
 -------------------------------------------------------------

Second Table:

 ------------------------------------------------------------
| OrangeIndex | OrangeCost | OrangeColor | OrangeDescription |
 ------------------------------------------------------------
|     1       |     1      | Orange      | This is an Orange |
|     2       |     3      | Orange      |                   |
|     3       |     2      | Orange      | This is an Orange |
|     4       |     3      | Orange      |                   |
|     5       |     2      | Orange      | This is an Orange |
 ------------------------------------------------------------

Currently, I'm using the following code to get too much data:

Dim Matches = From mRows In LinqMasterTable Join sRows In LinqSecondTable _
              On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
              Order By mRows(ThePrimaryKey) _
              Select mRows, sRows Distinct

Outcome:

 -------------------------------------------------------------------------
| 1  | 3 | Red    | This is an apple     | 1 | Orange | This is an Orange |
| 1  | 3 | Red    | This is an duplicate | 1 | Orange | This is an Orange |
| 2  | 5 | Green  | This is an apple     | 3 | Orange |                   |
| 2  | 5 | Green  | This is an duplicate | 3 | Orange |                   |
| 3  | 4 | Pink   | This is an apple     | 2 | Orange | This is an Orange |
| 3  | 4 | Pink   | This is an duplicate | 2 | Orange | This is an Orange |
| 4  | 2 | Yellow | This is an apple     | 3 | Orange |                   |
| 4  | 2 | Yellow | This is an duplicate | 3 | Orange |                   |
| 5  | 2 | Orange | This is an apple     | 2 | Orange | This is an Orange |
| 5  | 2 | Orange | This is an duplicate | 2 | Orange | This is an Orange |
 -------------------------------------------------------------------------

Desired Outcome:

 ------------------------------------------------------------------------
| 1 | 3 | Red    | This is an apple | 1 | 1 | Orange | This is an Orange |
| 2 | 5 | Green  | This is an apple | 2 | 3 | Orange |                   |
| 3 | 4 | Pink   | This is an apple | 3 | 2 | Orange | This is an Orange |
| 4 | 2 | Yellow | This is an apple | 4 | 3 | Orange |                   |
| 5 | 2 | Orange | This is an apple | 5 | 2 | Orange | This is an Orange |
 ------------------------------------------------------------------------

I have tried the following:

'Get the original Column Names into an Array List
'MasterTableColumns = GetColumns(qMasterDS, TheMasterTable) '(external code)

'Plug the Existing DataSet into a DataView:
Dim View As DataView = New DataView(qMasterTable)

'Sort by the Primary Key:
View.Sort = ThePrimaryKey

'Build a new table listing only one column:
Dim newListTable As DataTable = _
View.ToTable("UniqueData", True, ThePrimaryKey)

This returns a unique list, but no associated data:

 -------------
| AppleIndex  |
 -------------
|     1       | 
|     2       | 
|     3       |
|     4       |
|     5       |
 -------------

So I tried this instead:

'Build a new table with ALL the columns:
Dim newFullTable As DataTable = _
View.ToTable("UniqueData", True, _
     MasterTableColumns(0), _
     MasterTableColumns(1), _
     MasterTableColumns(2), _
     MasterTableColumns(3))

Unfortunately, it yields the following... with duplicates:

 -------------------------------------------------------------
| AppleIndex  | AppleCost  | AppleColor  | AppleDescription   |
 ------------------------------------------------------------
|     1       |     3      | Red         | This is an apple   |
|     2       |     5      | Green       | This is an apple   |
|     3       |     4      | Pink        | This is an apple   |
|     4       |     2      | Yellow      | This is an apple   |
|     5       |     2      | Orange      | This is an apple   |
|     1       |     3      | Red         | This is a duplicate|
|     2       |     5      | Green       | This is a duplicate|
|     3       |     4      | Pink        | This is a duplicate|
|     4       |     2      | Yellow      | This is a duplicate|
|     5       |     2      | Orange      | This is a duplicate|
 -------------------------------------------------------------

Any ideas?

~~~~~~~~~~~~ Update: ~~~~~~~~~~~~

Jeff M suggested the following code. (Thanks Jeff) However, it gives me a error. Does anyone know the syntax for making this work in VB? I've monkeyed with it a bit and can't seem to get it right.

Dim matches = _
    From mRows In (From row In LinqMasterTable _
        Group row By row(ThePrimaryKey) Into g() _
        Select g.First()) _
    Join sRows In LinqSecondTable _
    On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
    Order By mRows(ThePrimaryKey) _
    Select mRows, sRows

Error in Third row at "row(ThePrimaryKey)":

"Range variable name can be inferred only from a simple or qualified name with no arguments."

A: 

Edit:
Here's how I would write the C# LINQ query. Here's an alternate version rather than using Distinct(), uses a nested query with grouping which should have similar semantics. It should be easily convertible to VB.

var matches = from mRows in (from row in LinqMasterTable
                             group row by row[ThePrimaryKey] into g
                             select g.First())
              join sRows in LinqSecondTable
                  on mRows[ThePrimaryKey] Equals sRows[TheForignKey]
              orderby mRows[ThePrimaryKey]
              select new { mRows, sRows }

and my attempt at a VB version of the above:

Edit:
As for the most recent error, I know exactly how to deal with it. When I was playing with VB LINQ, I found that the compiler doesn't like complex grouping expressions. To get around that, assign row(ThePrimaryKey) to a temporary variable and group by that variable. It should work then.

Dim matches = From mRows In (From row In LinqMasterTable _
                             Let grouping = row(ThePrimaryKey)
                             Group row By grouping Into g() _
                             Select g.First()) _
              Join sRows In LinqSecondTable _
                  On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
              Order By mRows(ThePrimaryKey) _
              Select mRows, sRows

Actually upon second inspection, it turns out that what is being grouped by needs a name. The following will work.

Dim matches = From mRows In (From row In LinqMasterTable _
                             Group row By Grouping = row(ThePrimaryKey) Into g() _
                             Select g.First()) _
              Join sRows In LinqSecondTable _
                  On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
              Order By mRows(ThePrimaryKey) _
              Select mRows, sRows
Jeff M
The problem is that all the row in LinqMasterTable *are* distinct already.
James Curran
True, but I had intended, distinct by primary key. I just don't know how to express that in VB.
Jeff M
I assume you meant "given key", since primary keys are distinct by nature. And I don't think that's even expressable in SQL : "Group by given key, and select one row at random"?
James Curran
I left it using the OP's original code. If it was a true primary key, this duplication in the first table wouldn't be a problem in the first place.
Jeff M
So, the purpose of this prog is to allow the user to merge the data in two files on the selected "index" in ComboBox1 and CombpBox2. They want the option to remove duplicates. They don't care which item is kept, as long as it's only one item. Ergo, it matters not if it's "This is an Apple" or "This is a duplicate", as long as they don't get both back. I realize that I could omit the "Description" from the results or the query, but once the program is in the users hands, I'll have no control over what they put into it. Short of omitting everything but the key, I can't guarantee unique results.
Lord Baddkitty
Cool. I'll take a few more cracks at translating the C# to vb. the current example gives me the following error on "row(ThePrimaryKey)": "Range variable name can be inferred only from a simple or qualified name with no arguments."
Lord Baddkitty
Nothing doing. I don't understand why it wouldn't work as suggested above. I'm updating the problem with the proposed solution above. It's a good solution, I just can't make it work in VB.
Lord Baddkitty
Error: Definition of method 'g' is not accessible in this context.
Lord Baddkitty
I got it. Thanks for the input.
Lord Baddkitty
+1  A: 

Well, the basic problem isn't the LINQ. It's the fact the your First Table contains "duplicates", which aren't really duplicates, since in your example, every row is distinctive.

So, our question to you is "How do we identify the duplicates in the original table?". Once that is answered, the rest should be trivial.

For example (In C# since I'm not sure of the VB syntax)

var Matches = from mRows in LinqMasterTable
                             .Where(r=>r.AppleDescription=="This is an Apple")
              join sRows in LinqSecondTable 
                   on mRows(ThePrimaryKey) equals sRows(TheForignKey)  
              orderby mRows(ThePrimaryKey) 
              select new { mRows, sRows};
James Curran
Agreed. Not a true Primary Key, that's just the name of the string, used as a mnemonic to help keep track of what I'm doing while coding.
Lord Baddkitty
A: 

Declarations and Such:

Private Sub LinqTwoTableInnerJoin(ByRef qMasterDS As DataSet, _
                                  ByRef qMasterTable As DataTable, _
                                  ByRef qSecondDS As DataSet, _
                                  ByRef qSecondTable As DataTable, _
                                  ByRef qPrimaryKey As String, _
                                  ByRef qForignKey As String, _
                                  ByVal qResultsName As String)

Dim TheMasterTable As String = qMasterTable.TableName
Dim TheSecondTable As String = qSecondTable.TableName
Dim ThePrimaryKey As String = qPrimaryKey
Dim TheForignKey As String = qForignKey
Dim TheNewForignKey As String = ""

MasterTableColumns = GetColumns(qMasterDS, TheMasterTable)
SecondTableColumns = GetColumns(qSecondDS, TheSecondTable)

Dim mColumnCount As Integer = MasterTableColumns.Count
Dim sColumnCount As Integer = SecondTableColumns.Count

Dim ColumnCount As Integer = mColumnCount + sColumnCount

Dim LinqMasterTable = qMasterDS.Tables(TheMasterTable).AsEnumerable
Dim LinqSecondTable = qSecondDS.Tables(TheSecondTable).AsEnumerable

Get the Data and order it by the Selected Key:

Dim Matches = From mRows In LinqMasterTable Join sRows In LinqSecondTable _
             On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
             Order By mRows(ThePrimaryKey) _
             Select mRows, sRows

Put the Results into a Dataset Table:

' Make sure the dataset is available and/or cleared:
If dsResults.Tables(qResultsName) Is Nothing Then dsResults.Tables.Add(qResultsName)
dsResults.Tables(qResultsName).Clear() : dsResults.Tables(qResultsName).Columns.Clear()

'Adds Master Table Column Names
For x = 0 To MasterTableColumns.Count - 1
    dsResults.Tables(qResultsName).Columns.Add(MasterTableColumns(x))
Next

'Rename Second Table Names if Needed:
For x = 0 To SecondTableColumns.Count - 1
    With dsResults.Tables(qResultsName)
        For y = 0 To .Columns.Count - 1
            If SecondTableColumns(x) = .Columns(y).ColumnName Then
                SecondTableColumns(x) = SecondTableColumns(x) & "_2"
            End If
        Next
    End With
Next

'Make sure that the Forign Key is a Unique Value
If ForignKey1 = PrimaryKey Then
    TheNewForignKey = ForignKey1 & "_2"
Else
    TheNewForignKey = ForignKey1
End If

'Adds Second Table Column Names
For x = 0 To SecondTableColumns.Count - 1 
    dsResults.Tables(qResultsName).Columns.Add(SecondTableColumns(x))
Next

'Copy Results into the Dataset:
For Each Match In Matches

    'Build an array for each row:
    Dim NewRow(ColumnCount - 1) As Object

    'Add the mRow Items:
    For x = 0 To MasterTableColumns.Count - 1
        NewRow(x) = Match.mRows.Item(x)
    Next

    'Add the srow Items:
    For x = 0 To SecondTableColumns.Count - 1
        Dim y As Integer = x + (MasterTableColumns.Count)
        NewRow(y) = Match.sRows.Item(x)
    Next

    'Add the array to dsResults as a Row:
    dsResults.Tables(qResultsName).Rows.Add(NewRow)

Next

Give the user an option to clean doubles or not:

If chkUnique.Checked = True Then
    ReMoveDuplicates(dsResults.Tables(qResultsName), ThePrimaryKey)
End If

Remove the Duplicates if they so desire:

Private Sub ReMoveDuplicates(ByRef SkipTable As DataTable, _
                         ByRef TableKey As String)

    'Make sure that there's data to work with:
    If SkipTable Is Nothing Then Exit Sub
    If TableKey Is Nothing Then Exit Sub

    'Create an ArrayList of rows to delete:
    Dim DeleteRows As New ArrayList()

    'Fill the Array with Row Number of the items equal 
    'to the item above them:
    For x = 1 To SkipTable.Rows.Count - 1
        Dim RowOne As DataRow = SkipTable.Rows(x - 1)
        Dim RowTwo As DataRow = SkipTable.Rows(x)
        If RowTwo.Item(TableKey) = RowOne.Item(TableKey) Then
            DeleteRows.Add(x)
        End If
    Next

    'If there are no hits, exit this sub:
    If DeleteRows.Count < 1 Or DeleteRows Is Nothing Then
        Exit Sub
    End If

    'Otherwise, remove the rows based on the row count value:
    For x = 0 To DeleteRows.Count - 1

        'Start at the END and count backwards so the duplicate 
        'item's row count value doesn't change with each deleted row
        Dim KillRow As Integer = DeleteRows((DeleteRows.Count - 1) - x)

        'Delete the row:
        SkipTable.Rows(KillRow).Delete()

    Next
End Sub

Then clean up any leftovers:

If Not chkRetainKeys.Checked = True Then 'Removes Forign Key
    dsResults.Tables(qResultsName).Columns.Remove(TheNewForignKey)
End If

'Clear Arrays
MasterTableColumns.Clear()
SecondTableColumns.Clear()

Final Analysis: Ran this against 2 Files with 4 columns, 65,535 rows, and with some doubles. Process time, roughly 1 second. In fact it took longer to load the fields into memory than it did to parse the data.

Lord Baddkitty