Next: One-to-many with the DOW
Up: Merges
Previous: Merging a dataset with
Merging without sorting: the DATA step HASH object
I often have to merge a small dataset into a much larger dataset. If the larger dataset is not appropriately sorted, this is a process that consumes time and resources. An option is to use the DATA step HASH object.
The HASH object allows you to load a ``small'' dataset into RAM as a hash. A hash, or associative array, is a list indexed by a key. For instance, for the MSF dataset, the key variables are permno and date - which means that for a particular permno-date combination, the MSF dataset contains one observation, with the data being return, price, shares outstanding and so on. Given a permno-date combination, I could find the data for that combination by opening the MSF dataset and going through it until I hit the appropriate value. What a hash does is to allow you to instantly access the data for any specified permno-date combination without going through the entire dataset.
How do you create and use the hash? An example is probably the best way to explain.
In this example, I want to merge the monthly-frequency Fama-French factors with the MSF dataset, so I can run regressions to calculate betas.
data msf;
set crsp.msf(keep=permno date ret);
where ret is not missing;
date=intnx('month', date, 1)-1;
run;
data ff;
set ff.factors_monthly(keep=date mktrf smb hml umd rf);
date=intnx('month', date, 1)-1;
run;
All standard so far. Observe that I make the dates in both datasets end-of-month.
Now I define and use the hash.
data msf;
if _n_=1 then do;
declare hash h(dataset:'ff');
h.defineKey('date');
h.defineData('mktrf', 'smb', 'hml', 'umd', 'rf');
call missing(mktrf, smb, hml, umd, rf);
h.defineDone();
end;
set msf;
if h.find() = 0 then output;
run;
What this does is:
- On the first observation (_N_=1), before any data is read from the SET MSF statement, a hash named H is defined. This hash consists of data from the dataset FF. The hash key, which will be used to perform lookups, is defined to be DATE. The hash data is defined to be MKTRF, SMB, HML, UMD, RF. The CALL MISSING statement sets the value of these variables in the output dataset to missing initially. If you omit this, SAS will complain that the variable are not defined.
- Then the first observation is read from the SET MSF statement, loading values of DATE, PERMNO, and RET. Suppose these values are 10000, 19860228, and -0.257143. The H.FIND() statement immediately below the SET statement looks at the values of the KEY variables currently being processed. In this case, the KEY variable is DATE, which currently has the value 19860228. The H.FIND() statement searches the hash for a key value of 19860228. If it finds it, it populates the variables MKTRF, SMB, HML, UMD, RF with the data corresponding to that KEY in the hash, and returns a value of zero. If it does not find it, the variables MKTRF, SMB, HML, UMD, RF continue to have the missing value, and the H.FIND() returns a nonzero value
- If H.FIND() returns a zero value, indicating that it found that date in the dataset FF, the observation is output to the output dataset. Otherwise the observation is not output. If you wanted to keep both, observations for which a corresponding observation in FF exists, and those for which a corresponding observation does not exist, you could say:
data msf;
if _n_=1 then do;
declare hash h(dataset:'ff');
h.defineKey('date');
h.defineData('mktrf', 'smb', 'hml', 'umd', 'rf');
call missing(mktrf, smb, hml, umd, rf);
h.defineDone();
end;
set msf;
if h.find() = 0 then foundinFF=1;
else foundinFF=0;
run;
The foundinFF variable will tell you whether the observation existed in the FF dataset or not.
Some caveats:
- The hash is only useful when you are subsetting a larger dataset. If you wanted to keep observations in FF that did not exist in MSF, this form of the hash would not help.
- The dataset that is read into the hash has to be below a certain size, otherwise it will not fit into your memory. In my situation, MSF just about fits into memory. SAS will die with an out-of-memory error if you try to load a dataset that is larger than available memory.
The hash is of enormous utility in performing sortless merges, but it is also very useful in other situations. An example of this is many-to-many merges, explored in Section 12.4.2.
Because it is a headache to type out all this when you want to do merges with hashes, I wrote a macro that does it for me. See section 10.3.2 for the macro.
A nice introduction to the hash object can be found in Secosky and Bloom (2006).
Next: One-to-many with the DOW
Up: Merges
Previous: Merging a dataset with
Andre de Souza
2012-11-19