Next: Writing frequently-used code as Up: Loops Previous: %FOR variable in &LIST

### Subsetting observations from large datasets efficiently

Suppose you want to calculate the average correlation between stocks in each month in your sample. Calculating correlations means you will have a dataset with N-squared elements, where N is the number of stocks in that month. To do this, you write a macro loop (see section 10.2.1):

```     %let startdate= '31Jan1980'd;
%let enddate= '31Jan2000'd;

%macro justaloop;
%let startdate=%sysfunc(putn(&startdate, 8.));
%let enddate=%sysfunc(putn(&enddate, 8.));

%let date=&startdate;
%do %while(&date <= &enddate);

data tempdsf;
set crsp.dsf;
where intnx('month', date, 1)-1 = &date;
run;

<code to calculate correlations>

%let date=%eval(%sysfunc(intnx(month, &date+1, 1))-1);
%end;
%mend;
%justaloop;
```

This is inefficient because it takes long to subset DSF each time. Instead, do the following:

1. Outside the loop, sort DSF by date
2. Outside the loop, create an ``index'' dataset that tells you for each month where the data for that month begins and ends in the sorted DSF
3. Within the loop, use FIRSTOBS= and OBS= to read only the required observations

Sorting DSF and creating an index are simple:

```
proc sort data=crsp.dsf out=dsf;
where date>=&startdate and date<=&enddate;
by date;
run;

data dsfindex(keep=date firstobs lastobs);
set dsf;
by date;
retain firstobs;
if first.date then do;
firstobs=_n_;
end;
if last.date then do;
lastobs=_n_;
output;
end;
run;
```

DSFINDEX now contains, for each date, the observation numbers at which the observations for that date begin and end.

Within the loop, you then say:

```     data _null_;
set dsfindex;
where date=&date;
call symput('firstobs', firstobs);
call symput('lastobs', lastobs);
run;

data tempdsf;
set dsf(firstobs=&firstobs obs=&lastobs);
run;
```

The DATA _NULL_ statement just means that I am not creating a new dataset when I subset DSFINDEX. The only reason I am opening DSFINDEX is to get the values of FIRSTOBS and LASTOBS into the corresponding macro variables.

I then subset DSF to get TEMPDSF, keeping only the observations numbered between &FIRSTOBS and &LASTOBS. Note that the obs= statement tells you the last observation that will be read, not the number of observations to read.

TEMPDSF now contains all the observations for which DATE equals &DATE. The savings here are enormous: this takes virtually no time.

```  %let startdate= '31Jan1980'd;
%let enddate= '31Jan2000'd;

proc sort data=crsp.dsf out=dsf;
where date>=&startdate and date<=&enddate;
by date;

data dsfindex(keep=date firstobs lastobs);
set dsf;
by date;
retain firstobs;
if first.date then do;
firstobs=_n_;
end;
if last.date then do;
lastobs=_n_;
output;
end;
run;
%macro justaloop;
%let startdate=%sysfunc(putn(&startdate, 8.));
%let enddate=%sysfunc(putn(&enddate, 8.));

%let date=&startdate;
%do %while(&date <= &enddate);

data _null_;
set dsfindex;
where date=&date;
call symput('firstobs', firstobs);
call symput('lastobs', lastobs);
run;

data tempdsf;
set dsf(firstobs=&firstobs obs=&lastobs);
run;

<code to calculate correlations>

%let date=%eval(%sysfunc(intnx(month, &date+1, 1))-1);
%end;
%mend;
%justaloop;
```

Next: Writing frequently-used code as Up: Loops Previous: %FOR variable in &LIST
Andre de Souza 2012-11-19