I have a table with about 1000 records and 2000 columns. What I want to do is categorize each row such that all records with equal column values for all columns except 'ID' are given a category ID. My final answer would look like:
ID A B C ..... Category ID 1 1 0 3 1 2 2 1 3 2 3 1 0 3 1 4 2 1 3 2 5 4 5 6 3 6 4 5 6 3
where all columns (besides ID) are equal for IDs 1,3 so they get the same category ID and so on.
I guess my thought was to just write a SQL query that does a group by on every single column besides 'ID' and assign a number to each group and then join back to my original table. My current input is a text file, and I have SAS, MS Access, and Excel to work with. (I could use proc sql from within SAS).
Before I go this route and construct the whole query, I was just wondering if there was a better way to do this? It will take some work just to write the query, and I'm not even sure if it is practical to join on 2000 columns (never tried), so I thought I'd ask for ideas before I got too far down the wrong path.
EDIT: I just realized my title doesn't really make sense. What I was originally thinking was "Is there a way I can group by and categorize at the same time without actually consolidating into groups?"
EDIT2: After importing the table into Excel, I was easily able to determine that only about 200 of the 2000 columns actually varied, so problem with too many columns went away. To categorize, I only imported the columns that varied, and I did something like the following:
proc sql;
create table categories as
select distinct *
from inputTable;
quit;
data categories;
set categories;
categoryID = _N_;
run;
proc sql;
create table tableCategorized as
select a.ID, b.CategoryID
from inputTable as a, categories as b
where
(
a.A=b.A and
a.B=b.B and
a.C=b.C and
...
a.XYZ=b.XYZ);
;
quit;
It was a pain to generate all the "=" comparisons, but I just did it using string manipulation techniques in Excel, so it wasn't too bad at all. Thanks for all of the suggestions.