tags:

views:

18654

answers:

8

Hi all,

OK so I have two datatables A and B. They are both produced from CSV files. I need to be able to check which rows exist in B that do not exist in A.

Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same. The later option seems to be very intensive if the tables become large.

Thanks for the help.

+3  A: 

You can use the Merge and GetChanges methods on the DataTable to do this:

A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
MusiGenesis
I've been trying this and getting a null result, even though A and B were different (A was empty, B had data)
Chad
@Chad: In your case, A may be empty of both data *and* columns/fields. A and B have to have the same columns for this method to work.
MusiGenesis
A: 

Just FYI:

Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:

other_item= A.first()
only_in_B= empty_list()
for item in B:
    while other_item > item:
        other_item= A.next()
        if A.eof():
             only_in_B.add( all the remaining B items)
             return only_in_B
    if item < other_item:
         empty_list.append(item)
return only_in_B

The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.

ΤΖΩΤΖΙΟΥ
+3  A: 

would I have to iterate through each row on each DataTable to check if they are the same.

Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.

Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:

1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:

  • Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
  • Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.

This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )

2: An alternative approach, which may be more or less efficient depending on how big your data is:

  • Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
  • Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.

I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)

Orion Edwards
+4  A: 

Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)

IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
Jon Skeet
+3  A: 

The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.

But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)

The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.

A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow> that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.

This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.

The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>, and you should perform the check described in the previous paragraph against each DataRow in the list.

It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.

Robert Rossney
A: 

Hi guys,

Thanks for all the feedback.

I do not have any index's unfortunately. I will give a little more information about my situation.

We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.

The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.

The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.

I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).

The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).

The data I have about each report is as follows:

String - Task Name String - Action Name Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file). String - XML File called String - Report Name

I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).

The HashCode Idea sounds interesting as well.

Thanks for all the advice.

Jon
Using Merge and GetChanges will work (I just tested it). It's not the most efficient solution, but unless your CSV files are huge it won't be a problem.
MusiGenesis
Thank you Musi... I would have couple of hundred rows at a time so Merge sounds good.
Jon
A: 

public DataTable compareDataTables(DataTable First, DataTable Second) { First.TableName = "FirstTable"; Second.TableName = "SecondTable";

        //Create Empty Table
        DataTable table = new DataTable("Difference");
        DataTable table1 = new DataTable();
        try
        {
            //Must use a Dataset to make use of a DataRelation object
            using (DataSet ds4 = new DataSet())
            {
                //Add tables
                ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });

                //Get Columns for DataRelation
                DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];

                for (int i = 0; i < firstcolumns.Length; i++)
                {
                    firstcolumns[i] = ds4.Tables[0].Columns[i];
                }

                DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];

                for (int i = 0; i < secondcolumns.Length; i++)
                {
                    secondcolumns[i] = ds4.Tables[1].Columns[i];
                }

                //Create DataRelation
                DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);

                ds4.Relations.Add(r);

                //Create columns for return table
                for (int i = 0; i < First.Columns.Count; i++)
                {
                    table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
                }

                //If First Row not in Second, Add to return table.
                table.BeginLoadData();

                foreach (DataRow parentrow in ds4.Tables[0].Rows)
                { 
                    DataRow[] childrows = parentrow.GetChildRows(r);

                    if (childrows == null || childrows.Length == 0)
                        table.LoadDataRow(parentrow.ItemArray, true);
                    table1.LoadDataRow(childrows, false);

                }

                table.EndLoadData();

            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

        return table;
    }
A: 
        try
        {
            if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
            {
               for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
               {
                    for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
                   {
       if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
                       {


                        }
                        else
                        {

                           MessageBox.Show(i.ToString() + "," + j.ToString());


                       }

                                               }

                }

            }
            else
            {
               MessageBox.Show("Table has different columns ");
            }
        }
        catch (Exception)
        {
           MessageBox.Show("Please select The Table");
        }