Next: Rolling regressions without macros Up: Working with time-series data Previous: Moving dates around

Avoiding the lag() function

A common task in time-series data is to find the lag of a variable: that is, the value the variable took in the previous period. Suppose I wanted to find the value of return in the previous month for every permno at every month-end. I could do this by saying:
```     data msf;
set crsp.msf(keep=permno date ret);
date=intnx('month', date, 1)-1;
lagret=lag(ret);
run;
```
If you're puzzled by the

```        date=intnx('month', date, 1)-1;
```

statement, please see section 6.1 for an explanation.

One problem with this is that the first value of lagret for each permno is set to the last value of ret for the previous permno. To avoid this, I write

```     data msf;
set crsp.msf(keep=permno date ret);
by permno;
date=intnx('month', date, 1)-1;
lagret=lag(ret);
if first.permno then lagret=.;
run;
```
Observe that I did not write:
```     data msf;
set crsp.msf(keep=permno date ret);
by permno;
date=intnx('month', date, 1)-1;
if not(first.permno) then lagret=lag(ret);
run;
```
which, at first glance, seems equivalent. The problem is that the lag() function pushes the current variable value onto a stack every time it is called, and returns that value the next time it is called. This means that if the lag() function is not called on every observation, it will give counterintuitive results. In the code just shown, lag() is not being called for the first observation of each permno. Using the lag() function poses another problem. Suppose the value of return is missing in a particular month, as for July 1986 for permno 10000 below:

``` PERMNO        DATE            RET

10000    19860228      -0.257143
10000    19860331       0.365385
10000    19860430      -0.098592
10000    19860531      -0.222656
10000    19860630      -0.005025
10000    19860831      -0.615385
10000    19860930      -0.057143
10000    19861031      -0.242424
10000    19861130       0.060000
10000    19861231      -0.377358
10000    19870130      -0.212121
10000    19870228       0.000000
10000    19870331      -0.384615
10000    19870430      -0.062500
10000    19870531      -0.066667
10001    19860228       0.020408
10001    19860331       0.025200
10001    19860430       0.009901
10001    19860531      -0.009804
```

This would obviously create trouble for the approach above. One way to handle this would be to say:

```     data msf;
set crsp.msf(keep=permno date ret);
by permno;
date=intnx('month', date, 1)-1;
lagret=lag(ret);
lagdate=lag(date);
if first.permno then lagret=.;
if not(lagdate=intnx('month', date, 0)-1) then lagret=.;
run;
```

which checks to see if the lagged date is indeed a month previous to the current date. This is inelegant, and difficult to extend, if, for instance, you have several variables to lag.

```     data msf;
set crsp.msf(keep=permno date ret);
date=intnx('month', date, 1)-1;
run;

data msflag(rename=(ret=lagret));
set msf;
date=intnx('month', date+1, 1)-1;
run;

data msf;
merge msf msflag;
by permno date;
run;
```

This creates a new dataset msflag with all variables identical, except that the date is moved forward one month. This means that, in msflag, the return for January 1986 for each permno is going to have the date set to February of 1986. When you merge msflag with msf (by permno and date), the returns line up exactly as you want them to. Explicitly, msflag looks like this:

``` PERMNO        DATE        LAGRET

10000    19860331      -0.257143
10000    19860430       0.365385
10000    19860531      -0.098592
10000    19860630      -0.222656
10000    19860731      -0.005025
10000    19860930      -0.615385
10000    19861031      -0.057143
10000    19861130      -0.242424
10000    19861231       0.060000
10000    19870131      -0.377358
10000    19870228      -0.212121
10000    19870331       0.000000
10000    19870430      -0.384615
10000    19870531      -0.062500
10000    19860228      -0.066667
10001    19860331       0.020408
10001    19860430       0.025200
10001    19860531       0.009901
10001    19860630      -0.009804
```

This handles all the exceptions you can think of. The merged dataset is below:

```   Obs      PERMNO        DATE            RET         lagret

1       10000    19860228      -0.257143        .
2       10000    19860331       0.365385      -0.257143
3       10000    19860430      -0.098592       0.365385
4       10000    19860531      -0.222656      -0.098592
5       10000    19860630      -0.005025      -0.222656
6       10000    19860731        .            -0.005025
7       10000    19860831      -0.615385        .
8       10000    19860930      -0.057143      -0.615385
9       10000    19861031      -0.242424      -0.057143
10       10000    19861130       0.060000      -0.242424
11       10000    19861231      -0.377358       0.060000
12       10000    19870131      -0.212121      -0.377358
13       10000    19870228       0.000000      -0.212121
14       10000    19870331      -0.384615       0.000000
15       10000    19870430      -0.062500      -0.384615
16       10000    19870531      -0.066667      -0.062500
17       10000    19870630        .            -0.066667
18       10001    19860228       0.020408        .
19       10001    19860331       0.025200       0.020408
20       10001    19860430       0.009901       0.025200
```

This will not work with daily data. First, Friday is not one day before Monday, but that can be handled by using weekday as the incrementing unit in intnx. Second, holidays make any systematization impossible.

For daily data, you need another method. Suppose I want to find lag returns in the DSF file, that is, the return on the previous trading day.

The way I do it is to first create a dataset that contains, for every date in DSF, the date of the previous trading day. Then I merge this dataset into DSF by DATE, creating a dataset MAIN, which therefore has a LAGDATE associated with each DATE. Then I re-merge DSF with MAIN, but by PERMNO LAGDATE, so that the values of RET being merged into MAIN are the lagged values.

First create a file with all the dates in the dataset, and create a lagdate:

```  proc sort data=crsp.dsf(keep=date) out=dsfdates nodupkey;
by date;
run;

data dsfdates;
set dsfdates;
lagdate=lag(date);
run;
```

Then merge that into DSF, by date:

```  proc sort data=crsp.dsf(keep=permno date ret) out=dsf;
by date;
run;

data main;
merge dsf dsfdates;
by date;
run;
```

MAIN now has PERMNO DATE RET and LAGDATE. I need to get LAGRET. To do this, I merge DSF into MAIN, but by PERMNO LAGDATE, not PERMNO DATE:

```  data lagdsf(rename=(date=lagdate ret=lagret));
set crsp.dsf(keep=permno date ret);
run;

/*NOTE: DSF is already sorted by permno date*/

data main;
merge main(in=a) lagdsf;
by permno lagdate;
run;
```

This is horribly inefficient, because you have to sort DSF by date, which takes 3 minutes at the best of times, but it serves to illustrate the point.

Next: Rolling regressions without macros Up: Working with time-series data Previous: Moving dates around
Andre de Souza 2012-11-19