views:

267

answers:

2

After working with different data sets in SAS for a month or two, it seems to me that the more variables a data set has, the more time it takes to run PROCs and other operations on the data set. Yet if I have for example 5 variables, but 1 million observations, performance is not impacted too much.

While I'm interested in if observations or variables affect performance, I was also wondering if there are other factors I'm missing in looking at SAS performance?

Thanks!

+3  A: 

For the same size data set (rows*columns) the one with more variables will usually be slower I believe. I tried creating two data sets with either 1 row and 10000 columns, or 1 column and 10000 rows. The one with more variables took a lot more memory and time.

options fullstimer;
data a;
    retain var1-var10000 1;
run;
data b(drop=i);
    do i=1 to 10000;
    var1=i;
    output;
    end;
run;

On the Log

31   options fullstimer;
32   data a;
33       retain var1-var10000 1;
34   run;

NOTE: The data set WORK.A has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
      real time           0.23 seconds
      user cpu time       0.20 seconds
      system cpu time     0.03 seconds
      Memory                            5382k
      OS Memory                         14208k
      Timestamp            10/14/2009  2:03:57 PM


35   data b(drop=i);
36       do i=1 to 10000;
37       var1=i;
38       output;
39       end;
40   run;

NOTE: The data set WORK.B has 10000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      Memory                            173k
      OS Memory                         12144k
      Timestamp            10/14/2009  2:03:57 PM

You should also check out BUFNO= and BUFSIZE=. If you have to access a data set many times, you might consider using SASFILE as well to store the entire data set in memory.

cmjohns
SASFILE sounds useful, but my data set is too large for the machine I'm on (~1.8gb, and I get constant not-enough-memory errors when I try to SASFILE it). Thanks for doing those quick tests!
chucknelson
+2  A: 

I can't quite elucidate (and am making an educated guess), but I imagine it has something to do with a combination of factors, including that a whole record is read into the PDV, which means more data sits in memory with many variables.

It might be worth doing some measurements with compressed datasets, because I/O is often the bottleneck.

SAS dataset option:

data foo(compress=yes); ... run;

Rog
Thanks, I think I'll try out compression today. The HD on the machine I'm running on is constantly thrashing when running through the hundreds of PROC REPORT statements I'm currently using, so I think it is definitely I/O and compress might help!
chucknelson
Compressions is helping quite a bit with my large data set. Thanks!
chucknelson
In addition to single data sets, you can also set compression as a global option or as a library option for different levels of granularity.
cmjohns