views:

207

answers:

7

I have got 3 text files (A, B and C), each with several hundred email addresses. I want to merge list A and list B into a single file, ignoring differences in case and white space. Then I want to remove all emails in the new list that are in list C, again ignoring differences in case and white space.

My programming language of choice is normally C++, but it seems poorly suited for this task. Is there a scripting language that could do this (and similar tasks) in relatively few lines?

Or is there software already out there (free or commercial) that would allow me to do it? Is it possible to do it in Excel, for example?

+3  A: 

The fastest way to do this probably wouldn't necessarily require coding. You could import files A and B into Excel in one worksheet, then (if necessary) do a filter on that resulting list of addresses to remove any duplicates.

The next step would be to import file C into a second worksheet. In a third worksheet, you'd do a VLOOKUP to pick out all the addresses in your first list and remove them if they're in your "List C".

The VLOOKUP would look something like this:

=IF(ISNA(VLOOKUP(email_address_cell, Sheet2!email_duplicates_list, 1, false), "", (VLOOKUP(*email_address_cell*, Sheet2!*email_duplicates_list*, 1, false)))

I've also included a check to see if the formula returns a "Value Not Available" error, in which case the cell just shows a blank value. From there, you just need to remove your white-space and there's your final list.

Now having said all that, you could still do a VBA macro to do much the same thing, but perhaps clean the lists up a bit, depending on what you need. Hope that helps!

Phil.Wheeler
How does this address the whitespace and case requirement.
paxdiablo
What requirement? We want to ignore case differences (which Excel will do by default) and whitespace. Admittedly, you'll wind up with some blank lines using the above suggestion, but they can be easily removed using another autofilter.
Phil.Wheeler
+2  A: 

For text processing of the sort you describe, either perl or python are ideal.

You can use associative arrays (arrays with a string index in this case) to store the email addresses in a list.

Use the lowercased, un-whitespaced email address as a key and the real email address as the value.

Then it's a matter of reading in and storing the first file, reading in and storing the second (which will overwrite email addresses with the same key), then reading in the third file and deleting entries from the list with that key.

What you're then left with is the list you desire (A + B - C).

Pseudo-code here:

set list to empty
foreach line in file one:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file two:
    key = unwhitespace(tolowercase(line))
    list{key} = line
foreach line in file three:
    key = unwhitespace(tolowercase(line))
    if exists(list{key})
        delete list{key}
foreach key in list:
    print list{key}
paxdiablo
A: 

Excel can do it, as above. The programming language most suited though is Perl.

Mark Allen
+1  A: 

In Python, something like this:

Note, this will write lower-case emails to the final output. If that's not ok, then a dictionary-based solution would be necessary.

def read_file(filename):
    with file(filename, "r") as f:
        while True:
            line = f.readline();
            if not line:
                break;
            line = line.rstrip();
            if line:
                yield line;

def write_file(filename, lines):
    with file(filename, "w") as f:
        for line in lines:
            f.write(line + "\n");

set_a = set((line.lower() for line in read_file("file_a.txt")));
set_b = set((line.lower() for line in read_file("file_b.txt")));
set_c = set((line.lower() for line in read_file("file_c.txt")));

# Calculate (a + b) - c
write_file("result.txt", set_a.union(set_b).difference(set_c));
Lasse V. Karlsen
+2  A: 

As Excel was mentioned, you can also do this kind of thing with Jet and VBScript.

Set cn = CreateObject("ADODB.Connection")
strCon = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\Docs\;" _
& "Extended Properties=""text;HDR=No;FMT=Delimited"";"

cn.Open strCon

strSQL = "SELECT F1 Into New.txt From EmailsA.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL

strSQL = "INSERT INTO New.txt ( F1 ) SELECT F1 FROM EmailsB.txt " _
    & "WHERE UCase(F1) Not IN (SELECT UCase(F1) From EmailsC.txt)"
cn.Execute strSQL
Remou
+1  A: 

I think the above answers, answer the technical HOW TO question; the only thing left to consider is how many times will you have to perform the task. If it's a one-time thing and you're more comfortable with Excel, start there. If you know you will have perform this task at least twice and maybe more, then coding up a script or executable is the way to go.

simon
+1  A: 

Sadly this answer probably won't help you, but if in fact you were using Unix (Linux for example) you could do something like:

cat filea >> fileb # append file a to file b

sort fileb | uniq > newFile # newFile now contains a merger of file a and file b, with sorted and unique email addresses

The above could all be done on one line as follows: cat filea >> fileb | sort | uniq > newFile

Now you're left with simply removing common emails. Some variation of "diff" should be helpful there such as perhaps: diff newFile fileC > finalFile

diff will give you a list of differences between the two files, so the output in "finalFile" should be a list of emails that are in "newFile" (the merger of A & B) but are NOT in fileC. Options to the various tools allow you to ignore whitespace and case. I'd have to play with it a bit to get it exactly right but the above is the general idea.

I used to have an extra box running Linux for the sole purpose of doing stuff like the above which is a hassle under Windoze but a breeze under Unix type operating systems. When my hardware died I never got around to building another Linux box.

I believe the MKS toolkit for Windoze probably has all of the above utilities.