[an error occurred while processing this directive] Subsetting observations from large datasets efficiently next up previous
Next: Loops Up: Macro variables Previous: Macro variables

Subsetting observations from large datasets efficiently

In the section on efficiency (Section 14), I advocate using where statements to subset datasets before reading them in. For instance, say

     data dsf;
        set crsp.dsf(keep=permno date ret);
	where permno in(10001, 10002, 10003);
     run;

instead of

     data dsf;
        set crsp.dsf(keep=permno date ret);
	if permno=10001 or permno=10002 or permno= 10003;
     run;

What happens when you do not know in advance what permnos you want to keep? Or if you do know, but the list is a headache to type out?

In that case, you can replace the contents of the where with a macro variable. Suppose some previous steps produce a dataset, named KEEPTHESE that contains the permnos you want to keep:

   Obs      PERMNO

    1       10000
    2       10006
    3       10009
    4       10021

Given this dataset, I want to subset CRSP.DSF, keeping only the permnos listed in KEEPTHESE. To do this, I first run the following data step, which creates several macro variables.

     data _null_;
        set keepthese nobs=nobs;
	 if _n_ =1 then call symput("nobs", nobs);
         call symput("permname"||trim(left(put(_N_,4.))), permno);
     run;
The first line after the SET statement creates a macro variable named NOBS, which contains the number of observations in KEEPTHESE. The second line creates a series of macro variables named permname1, permname2, permname3, ... containing the permnos in the order in which they appeared in KEEPTHESE. Now inside a macro, I concatenate all these macro variables.
     %macro justconcatenate;
        %let listofpermnos=&permname1;
        %do i =2 %to &nobs;
           %let listofpermnos =%sysfunc(catx(%str(,),&listofpermnos, "%trim(&&permname&i)"));
        %end;
     %let listofpermnos=(&listofpermnos);
     %mend;
     %justconcatenate;
   
This macro just creates a macro variable named LISTOFPERMNOS whose value is set to the contents of the macro variable PERMNAME1. I then join all the remaining PERMNAMEs to this LISTOFPERMNOS with commas as delimiters.

Finally, I can use the macro variable so created:

     data dsf;
        set crsp.dsf(keep=permno date ret);
	where permno in &listofpermnos;
     run;

This might seem like a lot of bother for nothing, but supposing you are work with a large dataset, like TAQ, and you need to keep observations with symbols in a dataset like KEEPTHESE. Sorting and merging is not an option. This method works well, as would hashes(see section 12). I have not compared this method against a hash.


next up previous
Next: Loops Up: Macro variables Previous: Macro variables
[an error occurred while processing this directive]Andre de Souza 2012-11-19