views:

287

answers:

3

Hello everyone,

I am using C# + VSTS 2008 + .Net + ASP.Net + IIS 7.0 + ADO.Net + SQL Server 2008. I have a ADO.Net datatable object, and I want to filter out duplicate/similar records (in my specific rule to judge whether records are duplicate/similar -- if record/row has the same value for a string column, I will treat them as duplicate/similar records), and only keep one of such duplicate/similar records.

The output needs to be a datatable, may output the same datatable object if filter operation could be operated on the same datatable object.

What is the most efficient solution?

thanks in advance, George

+1  A: 

You can do a select into with a group by clause, so not duplicates are created. Then drop the old table and rename the table into which you selected to the original table name.

codymanix
Interesting! We can run SQL inside data table? Could you show me a sample please (or a document)? And I especially surprised how to use group by to filter duplicate records?
George2
+2  A: 

Are you using .NET 3.5? If you cast your data rows, you can use LINQ to Objects:

var distinctRows = table.Rows.Cast<DataRow>().Distinct(new E());

...

public class E : IEqualityComparer<DataRow>
{
    bool IEqualityComparer<DataRow>.Equals(DataRow x, DataRow y)
    {
        return x["colA"] == y["colA"];
    }

    int IEqualityComparer<DataRow>.GetHashCode(DataRow obj)
    {
        return obj["colA"].GetHashCode();
    }
}

Or an even simpler way, since you're basing it on a single column's values:

var distinct = from r in table.Rows.Cast<DataRow>()
               group r by (string)r["colA"] into g
               select g.First();

If you need to make a new DataTable out of these distinct rows, you can do this:

var t2 = new DataTable();
t2.Columns.AddRange(table.Columns.Cast<DataColumn>().ToArray());
foreach(var r in distinct)
{
    t2.Rows.Add(r);
}

Or if it would be more handy to work with business objects, you can do an easy conversion:

var persons = (from r in distinct
               select new PersonInfo
               {
                   EmpId = (string)r["colA"],
                   FirstName = (string)r["colB"],
                   LastName = (string)r["colC"],
               }).ToList();

...

public class PersonInfo
{
    public string EmpId {get;set;}
    public string FirstName {get;set;}
    public string LastName {get;set;}
}

Update

Everything you can do in LINQ to Objects can also be done without it: it just takes more code. For example:

var table = new DataTable();
var rowSet = new HashSet<DataRow>(new E());
var newTable = new DataTable();
foreach(DataColumn column in table.Columns)
{
    newTable.Columns.Add(column);
}
foreach(DataRow row in table.Rows)
{
    if(!rowSet.Contains(row))
    {
        rowSet.Add(row);
        newTable.Rows.Add(row);
    }
}

You could also use a similar strategy to simply remove duplicate rows from the original table instead of creating a new table.

StriplingWarrior
Any solution not based on LINQ?
George2
@George2: see my updated answer
StriplingWarrior
Thanks, question answered!
George2
+1  A: 

I would do this in the database layer:

SELECT Distinct...
FROM MyTable

Or if you need aggregates:

SELECT SUM(Field1), ID FROM MyTable
GROUP BY ID

Put the SELECT statement in a stored procedure. Then in .net make a connection to the database, call the stored procedure, execute .ExecuteNonQuery(). Return the rows in a datatable and return the datatable back to your UI.

JonH
Sorry there is no database table related to the datatable, it is an intermediate output from my business logics. We can only operate on the memory datatable. So, your database based solution can not be used. Any solutions?
George2