Suppose I want to merge crsp.msf(keeping price and shares outstanding), with the CRSP mutual fund holdings dataset, which has crsp_portno, permno, nbr_shares, and eff_dt. eff_dt is the effective date of the holding.
data msf(rename=date=eff_dt); set crsp.msf(keep=permno date ret prc shrout); date=intnx('month', date, 1)-1; run; data holdings; set crsp.holdings(keep=crsp_portno eff_dt nbr_shares permno); where permno is not missing; run; proc sort data=msf; by permno eff_dt; proc sort data=holdings; by permno eff_dt; data complete; set msf; do until(last.eff_dt); set holdings; by permno eff_dt; output complete; end; run;
This reads an observation from MSF. It then enters the DO-loop, and reads all observations from HOLDINGS until the LAST.EFF_DT=1, after which it goes to to the top of the DATA step and reads the next observation from MSF.
The obvious question is: how do you know that the permno-eff_dt combination read from HOLDINGS is the same as that read from MSF? In this setup, you don't. To make it work, before the DATA step with the DOW loop, you need to do the following:
proc sort data=holdings(keep=permno eff_dt) out=permnoeffdts_holdings nodupkey; by permno eff_dt; proc sort data=msf(keep=permno eff_dt) out=permnoeffdts_msf nodupkey; by permno eff_dt; data msf; merge msf(in=a) permnoeffdts_holdings(in=b); by permno eff_dt; if a and b; run; data holdings; merge holdings(in=a) permnoeffdts_msf(in=b); by permno eff_dt; if a and b; run;
This restricts the two datasets to contain only those permno-eff_dt combinations which exist in both.
When this is done, you run the DOW-loop as above:
proc sort data=msf; by permno eff_dt; proc sort data=holdings; by permno eff_dt; data complete; set msf; do until(last.eff_dt); set holdings; by permno eff_dt; output complete; end; run;
What happens now? Now you know that for every observation read from MSF, there is at least one observation in HOLDINGS with the same permno-eff_dt combination. Let me now go through the logic again.
The first observation is read from MSF. The permno-eff_dt combination of that observation exists in HOLDINGS, and because HOLDINGS contains only those combinations that are also in MSF, that combination must be the first combination encountered in HOLDINGS. We keep reading in observations from holdings (and outputting to COMPLETE) until LAST.EFF_DT=1, at which point there are no more observations in HOLDINGS with that particular combination of PERMNO and EFF_DT. Then we exit the DO-loop and read the next observation from MSF, and repeat the process.