ansaurus

Question

Answer 1

A:

You can do it with SAS. Since you're working with only 1000 row you can do something a little messy but thats easy to understand and works.

The idea i'm currently following is having an array with all the variables (two arrays if there are numeric and character variables) Then cycle through the arrays and have put the values of each single variables in a new container variable with an enormous length that will be an aggregate of all the values, in your example. CONT="103" for the first ID CONT="213" for the second ID ....

Since i'm not sure if you're working with numbers only i'd say having the new variable as an alfanumeric (thus converting all numeric when you append) is your best way

Once you have done this u'll have a new variable which is the aggregate of everything and you can easily use it to check if some obervation should be in the same category.

Code will follow shortly for your example =)

Here it is, i added some alfanumeric variable for flavor (sorry for formatting, trying to figure out how to format the code thing properly)

*ok i give up,after wasting more than 30 minutes i really cant put this code in a readable way which is a shame.code or blockquote doesnt work as i wanted.

data test;
ID=1; A=1; B=0; C=3; D='HI'; E=54; F='C';
output;
ID=2; A=2; B=7; C=3; D='KI'; E=3; F='C';
output;
ID=3; A=1; B=0; C=3; D='HI'; E=54; F='C';
output;
ID=4; A=2; B=7; C=3; D='KI'; E=3; F='C';
output;
ID=5; A=1; B=7; C=3; D='ZI'; E=3; F='C';
output;
ID=6; A=1; B=8; C=3; D='ZI'; E=3; F='D';
output;
ID=7; A=1; B=8; C=3; D='SI'; E=3; F='C';
output;
ID=8; A=1; B=8; C=3; D='SI'; E=3; F='C';
output;
run;

data test1;
set test;
array numeric(*) _NUMERIC_;
array chara(*) _CHARACTER_;
length cont $2000;
cont='';
do i=1 to dim(numeric);
IF vname(numeric{i})="ID" then continue;
cont=compress(cont)||compress(putn(numeric{i},"BEST."));
end;
do i=1 to dim(chara);
IF vname(chara{i})="ID" then continue;
cont=compress(cont)||compress(chara{i});
end;
drop i;
run;

proc sort data=test1;
by cont;
run;

data test2;
set test1;
by cont;
retain catid 0;
if first.cont then catid+1;
drop cont;
run;

proc sort data=test2 out=test3;
by id;
run;

Fabio Prevedelli 2010-05-27 10:05:08

the sum statement, catid + 1, implies that catid is retained.

Chang Chung 2010-06-17 18:55:58

Answer 2

A:

Since there is no easy way to do this (as I know of), I wrote a small VBScript to handle preprocessing of the file. Here is my testfile and the script I wrote (place them on your desktop, and run proc.vbs to generate output.txt):

[input.txt]
id  a   b   c
1   1   0   3
2   2   1   3
3   1   0   3
4   2   1   3
5   4   5   6
6   4   5   6

[proc.vbs]
' init
set fso = createobject("scripting.filesystemobject")
set input = fso.opentextfile("input.txt")
set output = fso.createtextfile("output.txt")
set dict = createobject("scripting.dictionary")

' read columns
columns = split(input.readline,vbtab)

' write columns, and add categoryid
output.writeline join(columns,vbtab) & vbtab & "categoryid"

' read rows
do while not input.atendofstream
    fields1 = split(input.readline,vbtab)
    fields2 = fields1

    ' in fields1. clear columns that we don't need for category
    for x = 0 to ubound(fields1)
        if lcase(columns(x)) = "id" then
            fields1(x) = ""
        end if
    next

    ' from fields1. create unique category string & add to dict if not exists
    unique = join(fields1,"|")
    if (not dict.exists(unique)) then
        dict.add unique, dict.count + 1
    end if

    ' write fields
    output.writeline join(fields2,vbtab) & vbtab & cstr(dict(unique))
loop

output.close
input.close

msgbox "finished!"

Fredrik Johansson 2010-05-27 10:15:41

thanks! I was thinking about something like this last night in python, but unfortunately, somebody else here has to be able to understand the whole process, so I can't use a script. booooo! oh well.

oob 2010-05-27 14:32:25

Answer 3

+3 A:

Well, I can think of an easy way, but I don't know whether you are going to hit SAS's memory/performance limits... I have never tried proc sort with 2000 variables, but maybe someone else has and can comment.

proc sort data= mydata;
    by A B C D /* etc.... */ myLastColumn;
run;

data mydata;
    set mydata;
    by A B C D /* etc....*/ myLastColumn;
    retain categoryID 0;
    if first.myLastColumn then categoryID +1;
run;

Louisa Grey 2010-05-27 14:00:20

Hi, i'm getting ready to try out your answer, but I've hit a snag in importing the file into SAS.proc import out=inputCSV datafile=file1 dbms=CSV REPLACE; getnames=YES; datarow=2; guessingrows=2000;RUN;It seems SAS has a row-length limit for proc import because it keeps cutting the rows off at a certain point. Is there a linesize or lrecl option for proc import? I know there is one for infile, but I can't find anything about this in the proc import documentation.

oob 2010-05-27 15:57:44

There's a system option-- use "options lrecl= <your value here>;"

Louisa Grey 2010-05-27 16:13:14

gotcha. looks like the limit is 32767. all of my lines are longer. i'll have to figure out a way to split it up.

oob 2010-05-27 16:41:59

Answer 4

A:

I don't have SAS in front of me so I'll have to give you some untested code but you can do this quite easily like so:

1) Concatenate all variables except ID into a single delimited text string. Ensure that the delimiter does not appear in the data. Perhaps use a character like byte(10) for the delimiter. Use catx() to do the concatenation so each var is trimmed (leading and trailing) to save on space. As you have a tonne of vars you may need to split this into 2, 3, or 4 strings... no big deal - it's just easier to work with 4 strings than it is to work with 2000 vars.

2) Let's assume you managed to fit it all into 1 big string. Add the value to an array with 1000 dimensions because there could potentially be 1000 unique identifiers for your table, ie. if every obs had a unique string. Each array element will store the value of the string. The position of the element in the array will be used as the unique id. Before adding each string to the array, loop through the entire array to ensure that it is not already in there. If you do find it in there already, then use the position as the unique_id for the current row.

Sooo that was kind of wordy but I'll have a crack at doing that in some untested code...

proc sql noprint;
  select varname from sashelp.vcolumns into :vars separated by ',' where varname ne 'id' and memname eq 'xx';
quit;


data yy;

  length str1-str1000 unique_string $32767;

  set xx;

  array arr_unique {1000} str1-str1000;
  retain unique_counter 1;

  unique_string = catx(byte(10),&vars);

  exists_already = 0;
  do i = 1 to (unique_counter -1);
    if arr_unique[i] eq unique_string then do;
      unique_id = i;
      exists_already = 1;
      leave;
    end;
  end;

  if not exists_already then do;
    arr_unique[unique_counter] = unique_string;
    unique_counter = unique_counter + 1;
  end;

  drop str1-str1000;

run;

Good luck!

Cheers Rob

Rob Penridge 2010-05-28 04:03:45

Answer 5

+1 A:

Try this

select *, dense_RANK() over(order by a,b,c) from table
-- order by id

msi77 2010-05-28 07:19:12

I am not sure dense_RANK() works in SAS or MS Access, but I did not know about this SQL function. Thank you! I will definitely use this in the future.

oob 2010-06-01 18:23:47

Access does not support rank/dense_rank.

msi77 2010-06-01 20:42:44

Answer 6

A:

This is easy with proc sort. Just list all the variables you want to use in grouping after by, and then increment the group id when the last variable changes value in the next data step. Below worked with 9.2 (TS1M0) on Windows. HTH.

EDIT: did not know Louisa Grey already suggested this. I guess my answer just shows that this indeed works... :-)

/* test data */
%let seed = 1234567;
data one;
  length id 8 v2 $1;
  array v[3:2000] v3-v2000;
  keep id v:;
  do id = 1 to 1e3;
    v2 = substr("ABCDEF", ceil(6*ranuni(&seed)), 1);
    do j = 3 to 9, 11 to 1999;
      v[j] = j;
    end;
    v[10] = ceil(2*ranuni(&seed));
    v[2000] = ceil(3*ranuni(&seed));
    output;
  end;
run;
/* on log
NOTE: The data set WORK.ONE has 1000 observations and 2000 variables.
*/

/* group observations based on all the variables
   values except id */
proc sort data=one;
  by v:;
run;
data two;
  set one;
  by v:;
  if first.v2000 then group + 1;
run;

/* check */
proc freq data=two;
  tables group;
run; 
proc print data=two;
  var id group v2 v10 v2000;
run;

Chang Chung 2010-06-17 18:40:06

ansaurus

tags:

views:

answers:

Categorize data without consolidating?

related questions