(Alternatively, suppose I am merging two datasets using a fuzzy merge variable. After the merge is done, I want to see which observations in one dataset were matched to a single observation in the other, and which were matched to two or more.)
Suppose I have a mutual fund holdings dataset that only contains, as stock information, the ticker of the stock. That is, it contains
CRSP_FUNDNO DATE STOCKTICKER SHARES_HELDI need to get the permno for these stocks. To do this, I first create a permno-date-ticker file as described in Section 6.2.
data stocknames(keep=permno date ticker); set crsp.stocknames(keep=namedt nameenddt permno ticker); format date date9.; where namedt is not missing; date=intnx('month', namedt, 1)-1; do while(date<=nameenddt); output; date=intnx('month', date+1, 1)-1; end; run;
The first thing to do is to check that this dataset has, at each date, only
To check that each ticker has only one permno, I do the following:
proc sort data=stocknames; by date ticker permno; run; data doubles singles; set stocknames; by date ticker permno; if not (first.ticker and last.ticker) then output doubles; else output singles; run; proc print data=doubles(obs=20); run;
The data step could do with some explaining. Since stocknames is sorted by DATE TICKER PERMNO, if each ticker linked to only one permno on each date, it should be the case that each date-ticker combination has only one entry in the dataset. If it does not, then that date-ticker combination links to more than one permno.
This produces output like this:
Obs PERMNO ticker date 1 29938 BFD 31JUL1962 2 29946 BFD 31JUL1962 3 27414 BO 31JUL1962 4 27561 BO 31JUL1962 5 29866 BQ 31JUL1962 6 29874 BQ 31JUL1962 7 30891 CDL 31JUL1962 8 30904 CDL 31JUL1962 9 31093 CLI 31JUL1962 10 31106 CLI 31JUL1962 11 32854 HG 31JUL1962 12 32862 HG 31JUL1962 13 32934 HUB 31JUL1962 14 32942 HUB 31JUL1962 15 35553 PDL 31JUL1962 16 35561 PDL 31JUL1962 17 35115 PKR 31JUL1962 18 35123 PKR 31JUL1962 19 22517 PPL 31JUL1962 20 35051 PPL 31JUL1962where, as you can see, the doubles dataset contains the desired elements. I then use the singles dataset, which has only one permno for each ticker, in further tests, notably, seeing whether each permno matches to only one ticker.
This process is easily automated in a macro:
%macro separatedups(inds, byvar, dupbyvar); proc sort data=&inds; by &byvar; run; data zzsingles zzdoubles; set &inds; by &byvar; if not(first.&dupbyvar and last.&dupbyvar) then output zzdoubles; else output zzsingles; run; %let doubsnobs=0; data _null_; set zzdoubles nobs=nobs; if _n_=1 then call symput("doubsnobs", trim(left(nobs))); run; %if "&doubsnobs"="0" %then %do; %end; %else %do; %put WARNING: &doubsnobs observations in doubles; %end; %mend separatedups;And called with:
%separatedups(stocknames, date ticker permno, ticker);
I can then check both ways by saying:
%separatedups(stocknames, date ticker permno, ticker); data stocknames; set zzsingles; run; %separatedups(stocknames, date permno ticker, permno); data stocknames; set zzsingles; run;
In the first call to %separatedups, there are multiple permnos for each ticker, so zzdoubles is not empty and the following message is written to the log:
WARNING: 35222 observations in doubles
In the second call, zzdoubles is empty: each permno matches to only one ticker.