views:

94

answers:

2

Hi, I have an excel spreadsheet with 15 columns, one of which is EmailAddress and then 100,000+ records..

In my data i know that there are many duplicate email addresses.

Can someone tell me how can i extract a distinct list where each record is represented only once by emailaddress?

Alternately, if i import the data into SQL, how can i remove records that are duplicated by emailaddress but leave one record if 4 are found...

Thx..

+1  A: 

In simpler tasks I would suggest the use of openrowset. Of course you could combine that with other queries following that, to do the filtering, but in this case it would be great to use DTS for MSSQL2000 and SSIS for later versions.

How?
the question is complex, and has 2 sub-questions that must have been answered previously.

  1. How to remove duplicates
  2. Excel and SQL

and you could read more about SSIS http://stackoverflow.com/search?q=SSIS+Excel

Alexander
A: 

See the link on How to remove duplicates in Alexander's answer, for dealing with it in SQL. (Note that the linked answer is specific to SQLServer - the syntax is likely to be slightly different if you are using other versions of SQL.)

For Excel, I suggest either:

  1. Using a pivot table, or
  2. Sort on EmailAddress, insert an additional column populated with formulas similar to:

    =IF(A1=A2,"","X")
    

    [- row 2, assuming that EmailAddress is in column A; copy and paste for the rest of the column] and use autofilter to select calculated values of X.

Mark Bannister