tags:

views:

1503

answers:

5

I have tried the following options with unacceptable response times - creating index 'key' did not help either(NOTE:duplicate'keys'in both datasets): data a; merge b c; by key if b; run; === OR === proc sql; create a as select * from b left outer join c on b.key; quit;

+1  A: 

You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.

+1  A: 

Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;

AFHood
A: 

Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.

AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.

None of these methods may work however. I routinely merge files of several million rows and it can take a while.

JT
A: 

You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!

kgb
A: 

Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a

..Proc SQL;
....Create table newdata
....as select firsttable.aster, secondtable.aster
....from table1 as firsttable
....inner join table2 as secondtable
....on (firstable.keyfield = secondtable.keyfield);
..quit;

If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.

If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.