views:

5003

answers:

8

I have checked the whole site and googled on the net but was unable to find a simple solution to this problem.

I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on all the columns.

It would be great if someone could help me with this.

EDIT: I am sorry for not being clear on this. This datatable is being created by reading a CSV file and not from a DB. So using an SQL query is not an option.

A: 

Use a query instead of functions:

DELETE FROM table1 AS tb1 INNER JOIN 
(SELECT id, COUNT(id) AS cntr FROM table1 GROUP BY id) AS tb2
ON tb1.id = tb2.id WHERE tb2.cntr > 1
Samiksha
+5  A: 

http://stackoverflow.com/questions/18932/sql-how-can-i-remove-duplicate-rows. (Adjust the query there to join on your 4 key columns)

EDIT: with your new information I believe the easiest way would be to implement IEqualityComparer<T> and use Distinct on your data rows. Otherwise if you're working with IEnumerable/IList instead of DataTable/DataRow, it is certainly possible with some LINQ-to-objects kung-fu.

EDIT: example IEqualityComparer

public class MyRowComparer : IEqualityComparer<DataRow>
{

    public bool Equals(DataRow x, DataRow y)
    {
        return (x.Field<int>("ID") == y.Field<int>("ID")) &&
            string.Compare(x.Field<string>("Name"), y.Field<string>("Name"), true) == 0 &&
          ... // extend this to include all your 4 keys...
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<int>("ID").GetHashCode() ^ obj.Field<string>("Name").GetHashCode() etc.
    }
}

You can use it like this:

var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);
liggett78
thanks ligett78 for rectifying :)
Samiksha
+1  A: 

If you have access to Linq I think you should be able to use the built in group functionality on the in memory collection and pick out the duplicate rows

Search Google for Linq Group by for examples

TT
wont this be like an over head??? when same thing can be done by a simple single query? No offense but i would like to know its adv over a single query?
Samiksha
Read the edit in the question. This is not Sql
TT
+3  A: 

You can use Linq to Datasets. Check this. Something like this:

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                              select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

DataTable table =
    System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
    table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
    Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}
Eduardo Campañó
A: 

Liggett78's answer is much better - esp. as mine had an error! Correction as follows...

DELETE TableWithDuplicates
    FROM TableWithDuplicates
     LEFT OUTER JOIN (
      SELECT PK_ID = Min(PK_ID), --Decide your method for deciding which rows to keep
       KeyColumn1,
       KeyColumn2,
       KeyColumn3,
       KeyColumn4
       FROM TableWithDuplicates
       GROUP BY KeyColumn1,
        KeyColumn2,
        KeyColumn3,
        KeyColumn4
      ) AS RowsToKeep
      ON TableWithDuplicates.PK_ID = RowsToKeep.PK_ID
    WHERE RowsToKeep.PK_ID IS NULL
A: 

Found this on bytes.com:

You can use the JET 4.0 OLE DB provider with the classes in the System.Data.OleDb namespace to access the comma delimited text file (using a DataSet/DataTable).

Or you could use Microsoft Text Driver for ODBC with the classes in the System.Data.Odbc namespace to access the file using ODBC drivers.

That would allow you to access your data via sql queries, as others proposed.

Treb
A: 

"This datatable is being created by reading a CSV file and not from a DB."

So put a unique constraint on the four columns in the database, and inserts that are duplicates under your design won't go in. Unless it decides to fail instead of continuing when this happens, but this surely is configurable in your CSV import script.

JeeBee
I am not inserting the data into a database. This data is being written to a csv file.
Khaja Minhajuddin
A: 

This is a very simple code which doesnot require linq nor individual columns to do the filter. If all the values of columns in a row are null it will be deleted.


    public DataSet duplicateRemoval(DataSet dSet) 
{
    bool flag;
    int ccount = dSet.Tables[0].Columns.Count;
    string[] colst = new string[ccount];
    int p = 0;

    DataSet dsTemp = new DataSet();
    DataTable Tables = new DataTable();
    dsTemp.Tables.Add(Tables);

    for (int i = 0; i < ccount; i++)
    {
        dsTemp.Tables[0].Columns.Add(dSet.Tables[0].Columns[i].ColumnName, System.Type.GetType("System.String"));
    }

    foreach (System.Data.DataRow row in dSet.Tables[0].Rows)
    {
        flag = false;
        p = 0;
        foreach (System.Data.DataColumn col in dSet.Tables[0].Columns)
        {
            colst[p++] = row[col].ToString();
            if (!string.IsNullOrEmpty(row[col].ToString()))
            {  //Display only if any of the data is present in column
                flag = true;
            }
        }
        if (flag == true)
        {
            DataRow myRow = dsTemp.Tables[0].NewRow();
            //Response.Write("<tr style=\"background:#d2d2d2;\">");
            for (int kk = 0; kk < ccount; kk++)
            {
                myRow[kk] = colst[kk];         

                // Response.Write("<td class=\"table-line\" bgcolor=\"#D2D2D2\">" + colst[kk] + "</td>");
            }
            dsTemp.Tables[0].Rows.Add(myRow);
        }
    } return dsTemp;
}

This can even be used to remove null data from excel sheet.

Srikanth V M