tags:

views:

60

answers:

3

I need to develop an application where two csv files are compared. The first file has a list of email addresses. The second list also has email addresses, but includes name and address info. The first list contains email addresses that need to be removed from the second list. I have the Fast CSV reader from the CodeProject site which works pretty well. The application will not have access to a database server. A new file wil be generated with data that is considered verified. Meaning, it will not contain any of the information from the first file.

+2  A: 

If you read both lists into collections, you can use Linq to determine the subset of addresses.

Here is a quick example class I whipped up for you.

using System;
using System.Linq;
using System.Collections.Generic;

public class RemoveExample
{
    public List<Item> RemoveAddresses(List<Item> sourceList, List<string> emailAddressesToRemove)
    {
        List<Item> newList = (from s in sourceList
                              where !emailAddressesToRemove.Contains(s.Email)
                              select s).ToList();
        return newList;
    }

    public class Item
    {
        public string Email { get; set; }
        public string Name { get; set; }
        public string Address { get; set; }
    }
}

To use it, read your csv into a List, then pass it, and your list of addresses to remove as a List into the method.

Ricky Smith
Exactly what I was looking for. Also, if anyone is interested I found this cool LINQ to CSV library: http://www.codeproject.com/KB/linq/LINQtoCSV.aspx
DDiVita
+1  A: 

Not sure what kind of advice you need, it sounds straight forward.

heres a quick algorithm sketch:

  • loop through email from first csv
    • put each email in a HashSet<>
  • run your delete
  • put each output email in the same HashSet<>
    • if there is a DuplicateKeyException, you missed one in the delete
    • if emailList2.Count - emailList1.Count != outputList.Count, you deleted too many
BioBuckyBall
A: 

This is relatively simple, assuming the lists aren't terribly large or memory usage isn't an overly large concern: Read both sets of emails addresses in two separate HashSet<string> instances. Then, you can use HashSet<T>.ExceptsWith to find the differences between the two sets. For instance:

HashSet<string> setA = ...;
HashSet<string> setB = ...;

setA.ExceptWith(setB); // Remove all strings in setB from setA

// Print all strings that were in setA, but not setB
foreach(var s in setA)
   System.Console.WriteLine(s);

BTW, the above should be O(n*log(n)) complexity, versus using the Linq answer, which would be O(n^2) on non-indexed data.

Nathan Ernst