data msf; set crsp.msf(keep=permno date ret); date=intnx('month', date, 1)-1; lagret=lag(ret); run;If you're puzzled by the
date=intnx('month', date, 1)-1;
statement, please see section 6.1 for an explanation.
One problem with this is that the first value of lagret for each permno is set to the last value of ret for the previous permno. To avoid this, I write
data msf; set crsp.msf(keep=permno date ret); by permno; date=intnx('month', date, 1)-1; lagret=lag(ret); if first.permno then lagret=.; run;Observe that I did not write:
data msf; set crsp.msf(keep=permno date ret); by permno; date=intnx('month', date, 1)-1; if not(first.permno) then lagret=lag(ret); run;which, at first glance, seems equivalent. The problem is that the lag() function pushes the current variable value onto a stack every time it is called, and returns that value the next time it is called. This means that if the lag() function is not called on every observation, it will give counterintuitive results. In the code just shown, lag() is not being called for the first observation of each permno. Using the lag() function poses another problem. Suppose the value of return is missing in a particular month, as for July 1986 for permno 10000 below:
PERMNO DATE RET 10000 19860228 -0.257143 10000 19860331 0.365385 10000 19860430 -0.098592 10000 19860531 -0.222656 10000 19860630 -0.005025 10000 19860831 -0.615385 10000 19860930 -0.057143 10000 19861031 -0.242424 10000 19861130 0.060000 10000 19861231 -0.377358 10000 19870130 -0.212121 10000 19870228 0.000000 10000 19870331 -0.384615 10000 19870430 -0.062500 10000 19870531 -0.066667 10001 19860228 0.020408 10001 19860331 0.025200 10001 19860430 0.009901 10001 19860531 -0.009804
This would obviously create trouble for the approach above. One way to handle this would be to say:
data msf; set crsp.msf(keep=permno date ret); by permno; date=intnx('month', date, 1)-1; lagret=lag(ret); lagdate=lag(date); if first.permno then lagret=.; if not(lagdate=intnx('month', date, 0)-1) then lagret=.; run;
which checks to see if the lagged date is indeed a month previous to the current date. This is inelegant, and difficult to extend, if, for instance, you have several variables to lag.
Instead, do the following:
data msf; set crsp.msf(keep=permno date ret); date=intnx('month', date, 1)-1; run; data msflag(rename=(ret=lagret)); set msf; date=intnx('month', date+1, 1)-1; run; data msf; merge msf msflag; by permno date; run;
This creates a new dataset msflag with all variables identical, except that the date is moved forward one month. This means that, in msflag, the return for January 1986 for each permno is going to have the date set to February of 1986. When you merge msflag with msf (by permno and date), the returns line up exactly as you want them to. Explicitly, msflag looks like this:
PERMNO DATE LAGRET 10000 19860331 -0.257143 10000 19860430 0.365385 10000 19860531 -0.098592 10000 19860630 -0.222656 10000 19860731 -0.005025 10000 19860930 -0.615385 10000 19861031 -0.057143 10000 19861130 -0.242424 10000 19861231 0.060000 10000 19870131 -0.377358 10000 19870228 -0.212121 10000 19870331 0.000000 10000 19870430 -0.384615 10000 19870531 -0.062500 10000 19860228 -0.066667 10001 19860331 0.020408 10001 19860430 0.025200 10001 19860531 0.009901 10001 19860630 -0.009804
This handles all the exceptions you can think of. The merged dataset is below:
Obs PERMNO DATE RET lagret 1 10000 19860228 -0.257143 . 2 10000 19860331 0.365385 -0.257143 3 10000 19860430 -0.098592 0.365385 4 10000 19860531 -0.222656 -0.098592 5 10000 19860630 -0.005025 -0.222656 6 10000 19860731 . -0.005025 7 10000 19860831 -0.615385 . 8 10000 19860930 -0.057143 -0.615385 9 10000 19861031 -0.242424 -0.057143 10 10000 19861130 0.060000 -0.242424 11 10000 19861231 -0.377358 0.060000 12 10000 19870131 -0.212121 -0.377358 13 10000 19870228 0.000000 -0.212121 14 10000 19870331 -0.384615 0.000000 15 10000 19870430 -0.062500 -0.384615 16 10000 19870531 -0.066667 -0.062500 17 10000 19870630 . -0.066667 18 10001 19860228 0.020408 . 19 10001 19860331 0.025200 0.020408 20 10001 19860430 0.009901 0.025200
This will not work with daily data. First, Friday is not one day before Monday, but that can be handled by using weekday as the incrementing unit in intnx. Second, holidays make any systematization impossible.
For daily data, you need another method. Suppose I want to find lag returns in the DSF file, that is, the return on the previous trading day.
The way I do it is to first create a dataset that contains, for every date in DSF, the date of the previous trading day. Then I merge this dataset into DSF by DATE, creating a dataset MAIN, which therefore has a LAGDATE associated with each DATE. Then I re-merge DSF with MAIN, but by PERMNO LAGDATE, so that the values of RET being merged into MAIN are the lagged values.
First create a file with all the dates in the dataset, and create a lagdate:
proc sort data=crsp.dsf(keep=date) out=dsfdates nodupkey; by date; run; data dsfdates; set dsfdates; lagdate=lag(date); run;
Then merge that into DSF, by date:
proc sort data=crsp.dsf(keep=permno date ret) out=dsf; by date; run; data main; merge dsf dsfdates; by date; run;
MAIN now has PERMNO DATE RET and LAGDATE. I need to get LAGRET. To do this, I merge DSF into MAIN, but by PERMNO LAGDATE, not PERMNO DATE:
data lagdsf(rename=(date=lagdate ret=lagret)); set crsp.dsf(keep=permno date ret); run; /*NOTE: DSF is already sorted by permno date*/ data main; merge main(in=a) lagdsf; by permno lagdate; run;
This is horribly inefficient, because you have to sort DSF by date, which takes 3 minutes at the best of times, but it serves to illustrate the point.