next up previous
Next: Writing frequently-used code as Up: Loops Previous: %FOR variable in &LIST

Subsetting observations from large datasets efficiently

Suppose you want to calculate the average correlation between stocks in each month in your sample. Calculating correlations means you will have a dataset with N-squared elements, where N is the number of stocks in that month. To do this, you write a macro loop (see section 10.2.1):

     %let startdate= '31Jan1980'd;
     %let enddate= '31Jan2000'd;
     
     %macro justaloop;
     %let startdate=%sysfunc(putn(&startdate, 8.));
     %let enddate=%sysfunc(putn(&enddate, 8.));

     %let date=&startdate;
      %do %while(&date <= &enddate);

            data tempdsf;
               set crsp.dsf;
	       where intnx('month', date, 1)-1 = &date;
	    run;
	    
	    <code to calculate correlations>

            %let date=%eval(%sysfunc(intnx(month, &date+1, 1))-1);
        %end;
     %mend;
     %justaloop;
   

This is inefficient because it takes long to subset DSF each time. Instead, do the following:

  1. Outside the loop, sort DSF by date
  2. Outside the loop, create an ``index'' dataset that tells you for each month where the data for that month begins and ends in the sorted DSF
  3. Within the loop, use FIRSTOBS= and OBS= to read only the required observations

Sorting DSF and creating an index are simple:

  
  proc sort data=crsp.dsf out=dsf;
    where date>=&startdate and date<=&enddate;
    by date;
  run;
    
  data dsfindex(keep=date firstobs lastobs);
     set dsf;
     by date;
     retain firstobs;
     if first.date then do;
           firstobs=_n_;
        end;
     if last.date then do;
           lastobs=_n_;
           output;
        end;
    run;

DSFINDEX now contains, for each date, the observation numbers at which the observations for that date begin and end.

Within the loop, you then say:

     data _null_;
        set dsfindex;
	where date=&date;
	call symput('firstobs', firstobs);
        call symput('lastobs', lastobs);
     run;

     data tempdsf;
        set dsf(firstobs=&firstobs obs=&lastobs);
     run;

The DATA _NULL_ statement just means that I am not creating a new dataset when I subset DSFINDEX. The only reason I am opening DSFINDEX is to get the values of FIRSTOBS and LASTOBS into the corresponding macro variables.

I then subset DSF to get TEMPDSF, keeping only the observations numbered between &FIRSTOBS and &LASTOBS. Note that the obs= statement tells you the last observation that will be read, not the number of observations to read.

TEMPDSF now contains all the observations for which DATE equals &DATE. The savings here are enormous: this takes virtually no time.

The complete code thus reads:

  %let startdate= '31Jan1980'd;
  %let enddate= '31Jan2000'd;

  proc sort data=crsp.dsf out=dsf;
      where date>=&startdate and date<=&enddate;
      by date;

  data dsfindex(keep=date firstobs lastobs);
     set dsf;
     by date;
     retain firstobs;
     if first.date then do;
           firstobs=_n_;
        end;
     if last.date then do;
           lastobs=_n_;
           output;
        end;
  run;
  %macro justaloop;
    %let startdate=%sysfunc(putn(&startdate, 8.));
    %let enddate=%sysfunc(putn(&enddate, 8.));
    
    %let date=&startdate;
      %do %while(&date <= &enddate);
    

            data _null_;
               set dsfindex;
	       where date=&date;
	       call symput('firstobs', firstobs);
               call symput('lastobs', lastobs);
	    run;

            data tempdsf;
               set dsf(firstobs=&firstobs obs=&lastobs);
             run;
	    
	    <code to calculate correlations>

           %let date=%eval(%sysfunc(intnx(month, &date+1, 1))-1);
      %end;
   %mend;
   %justaloop;
   


next up previous
Next: Writing frequently-used code as Up: Loops Previous: %FOR variable in &LIST
Andre de Souza 2012-11-19