ansaurus

Question

Is it possible to do a case-insensitive DISTINCT with SAS (PROC SQL)?

Answer 1

+1 A:

Normalizing case does seem advisable -- if 'Australia', 'australia' and 'AUSTRALIA' all occur, which one of the three would you want as the "case-sensitively unique" answer to your query, after all? If you're keen on some specific heuristics (e.g. count how many times they occur and pick the most popular), this can surely be done but might be a huge amount of extra work -- so, how much is such persnicketiness worth to you?

Alex Martelli 2009-05-29 05:39:20

The intention is use a more complicated query to show addresses for a user to choose from, and it doesn't matter a great deal which one is displayed, as long as there aren't loads of case-wise duplicates.

Rog 2009-05-29 06:37:04

Answer 2

+4 A:

If you have some primary int key (let's call it ID), you could use:

SELECT country FROM companies
WHERE id =
(
    SELECT Min(id) FROM companies
    GROUP BY Upper(country)
)

Roee Adler 2009-05-29 05:40:31

Thanks. I suspect this is the best available approach. I also imagine it will be difficult to scale across mutiple distinct columns and thousands of rows, but will give it a shot.My real life query is somewhat more elaborate than the example used for my question.

Rog 2009-05-29 06:42:49

There must be some heuristic in the solution, and I believe you need to decide on it explicitly yourself (e.g. choose the one with lowest ID). If your SQL issue is "larger" - post a more elaborate question and let the SO crowd give it a try... :)

Roee Adler 2009-05-29 07:25:18

Can you explain in your question why the upper() solution is less than optimal? In this solution, the case (Proper, upper, lower, crazy, etc.) is completely arbitrary and based on the order in the data. Why would the first instance of case be more relevant as a return result than the upper(country) instance of case?

Jay Stevens 2009-06-03 16:40:07

Answer 3

+1 A:

A non-SQL method (really only a single step as the data step just creates a view) would be:


data companies_v /view=companies_v;
  set companies (keep=country);
  _upcase_country = upcase(country);
run;

proc sort data=companies_v out=companies_distinct_countries (drop=_upcase_country) nodupkey noequals;
  by _upcase_country;
run;

Simon Nickerson 2009-05-29 07:17:27

Answer 4

A:

Hi Rog, I think Regular expressions can help you out with the pattern you want to have in your search string.

For the regular expression you can define a UDF which can be prepared seeing the tutorial. www.sqlteam.com/article/regular-expressions-in-t-sql

Thanks, Ashish Jain

Beginner 2009-05-29 08:11:30

Answer 5

A:

Maybe I'm missing something, but why not just:

data testZ;
    input Name $;
    cards4;
Bob
Zach
Tim
Eric
Frank
ZacH
BoB
eric
;;;;
run;

proc sql;
    create view distinctNames as
    select distinct Upper(Name) from testz;
quit;

This creates a view with only distinct names as row values.

Zach 2009-05-29 18:22:54

This unnecessarily changes values in a manner that doesn't suit the purpose of this query. In other words, if a non-duplicate lower (or proper) case version exists, that is what should appear in the results.

Rog 2009-05-31 16:52:06

Answer 6

A:

I was thinking along the same lines as Zach, but thought I would look at the problem with a more elaborate example,

proc sql;
    CREATE TABLE contacts (
     line1 CHAR(30), line2 CHAR(30), pcode CHAR(4)
    );
    * Different versions of the same address - L23 Bass Plaza 2199;
    INSERT INTO contacts values('LEVEL 23 bass', 'plaza'  '2199');
    INSERT INTO contacts values('level 23 bass ', ' PLAZA'  '2199');

    INSERT INTO contacts values('Level 23', 'bass plaza'  '2199');
    INSERT INTO contacts values('level 23', 'BASS plaza'  '2199');

    *full address in line 1;
    INSERT INTO contacts values('Level 23 bass plaza', ''  '2199');
    INSERT INTO contacts values(' Level 23 BASS plaza  ', ''  '2199');

;quit;

Now we can output
i. One from each category? Ie three addresses ?
OR
ii. Or just one address ? if so which version should we prefer ?

Implementing case 1 can be as simple as :

proc sql;
    SELECT DISTINCT UPCASE(trim(line1)), UPCASE(trim(line2)), pcode 
    FROM contacts 
;quit;

Implementing case 2 can be as simple as:

proc sql;
    SELECT DISTINCT UPCASE( trim(line1) || ' ' || trim(line2) ) , pcode 
    FROM contacts 
;quit;

Raz 2009-05-29 23:23:36

In the second case, we could use a *special character* to delimit the columns so that we can separate them later. ie. SELECT DISTINCT UPCASE( trim(line1) || '$$' || trim(line2) ) AS address, pcode FROM contacts ;So if we can tokenize the address back to line1 and line2 if necessary.. Just a though I'm sure there are better ways of doing it.. :)

Raz 2009-05-29 23:28:30

We would want case 2 (1 address), but ideally the result should be one of the existing addresses, not a new uppercase version.

Rog 2009-05-31 16:56:49

ansaurus

tags:

views:

answers:

Is it possible to do a case-insensitive DISTINCT with SAS (PROC SQL)?

related questions