Suppose I have monthly data on stock returns and I want to naively test the hypothesis that smaller stocks have higher returns. From crsp.msf I can construct a dataset with return and lagged size, following the prescription in Section 7.2 to construct lagged sizes:
data msf(keep=permno date ret) msflag(keep=permno date size rename=(size=lagsize)); set crsp.msf(keep=permno date ret prc shrout); date=intnx('month', date, 1)-1; output msf; size=abs(prc)*shrout; date=intnx('month', date+1, 1)-1; output msflag; run; data msf; merge msf msflag; by permno date; run;
MSF now has
PERMNO DATE RET LAGSIZE
Without dummies, proc reg is:
proc reg data=msf; model ret=lagsize; run;
Suppose I want permno fixed effects. Let me describe the steps I need to do this brute-force. First, I find out how many permnos there are. Then I create that many dummies, which are variables such that, for each observation, only one can be 1, the others are all zero. Then I run proc reg including these variables on the right-hand side.
So first, I must first find out how many permnos there are. To find this, I say:
proc sort data=msf out=count_of_permnos(keep=permno) nodupkey; where ret is not missing and lagsize is not missing; by permno; run;
COUNT_OF_PERMNOS contains a list of permnos which have data for the regression.
To find out how many there are, we can do:
data count_of_permnos; if _n_=1 then put nobs; set count_of_permnos nobs=nobs; run;
This might strike you as peculiar: why is the IF statement before the SET statement? Will the NOBS value have been populated when the IF statement is read? The answer is that NOBS is populated as soon as the data step code has been compiled, before any observations have been read. At the same time, the datasets named in SET, MERGE, UPDATE, and DATA statements are created with the variables they are expected to contain.
Now I open the log and see what value nobs took. The log contains
29 data count_of_permnos; 30 if _n_=1 then put nobs; 31 set count_of_permnos nobs=nobs; 32 run; 27865 NOTE: There were 27865 observations read from the data set WORK.COUNT_OF_PERMNOS. NOTE: The data set WORK.COUNT_OF_PERMNOS has 27865 observations and 1 variables.
which tells me what I want to know.
I then proceed to create my dummies. I will do this using arrays:
data count_of_permnos; set count_of_permnos; array permnonames (27865) permnonames1-permnonames27865; do i = 1 to dim(permnonames); permnonames(i)=0; end; permnonames(_n_)=1; run;
For each observation, this creates 27865 variables named permnonames1 through permnonames27865. It puts them into an array named permnonames, so that we can do something to each of them programmatically. The do-loop sets all members of the array to zero, and then the permnonames dummy corresponding to the current permno (i.e., the _N_ th permno: see Subsection 3.2.2 for a description of _N_). I do not recommend printing this dataset out.
I then merge these dummies back into the original dataset and run my regression:
data msf; merge count_of_permnos msf; by permno; run; proc reg data=msf; model ret=lagsize permnonames1-permnonames27865/noint; run;
The noint excludes the intercept from the model. Alternatively, you could say :
proc reg data=msf; model ret=lagsize permnonames2-permnonames27865; run;
If you'd rather not manually look at your log file to see how many dummies there need to be, you can use CALL SYMPUT in the following way:
data count_of_permnos; if _n_=1 then call symput(``nobs'', nobs); set count_of_permnos nobs=nobs; run; data count_of_permnos; set count_of_permnos; array permnonames (&nobs) permnonames1-permnonames&nobs; do i = 1 to dim(permnonames); permnonames(i)=0; end; permnonames(_n_)=1; run;
All I have done is to replace the number 27865 with the macro variable &NOBS, which I created in a previous data step. You cannot use the non-macro variable nobs in defining the array because array definition requires an integer constant as dimension, not a variable.
Running this poses several problems. First: you have 27866 independent variables! The matrix to invert is enormous. This is clearly massively inefficient. Second: when you run proc reg like this, it's going to print out the values of those 27865 dummy coefficents, which you don't care about at all. See Section 10.3.5 for an example of how to avoid this.