next up previous
Next: Sundry macros Up: Writing frequently-used code as Previous: Hash-merging without having to

Checking for duplicate by-variable combinations

Before a merge, I frequently want to check that there is only one instance of every by-variable combination in one or both of my datasets. If I am doing what I think is a one-to-one merge, then obviously each dataset should contain only one instance of every by-variable combination. If I am doing a many-to-one merge, then one dataset should have only one instance of every by-variable combination.

(Alternatively, suppose I am merging two datasets using a fuzzy merge variable. After the merge is done, I want to see which observations in one dataset were matched to a single observation in the other, and which were matched to two or more.)

Suppose I have a mutual fund holdings dataset that only contains, as stock information, the ticker of the stock. That is, it contains

  CRSP_FUNDNO DATE STOCKTICKER SHARES_HELD
I need to get the permno for these stocks. To do this, I first create a permno-date-ticker file as described in Section 6.2.
  data stocknames(keep=permno date ticker);
        set crsp.stocknames(keep=namedt nameenddt permno ticker);
	format date date9.;
	where namedt is not missing;
	date=intnx('month', namedt, 1)-1; 
	do while(date<=nameenddt);
	   output;
	   date=intnx('month', date+1, 1)-1;
	   end;
     run;

The first thing to do is to check that this dataset has, at each date, only

  1. One ticker for each permno and
  2. One permno for each ticker.

To check that each ticker has only one permno, I do the following:

proc sort data=stocknames;
    by date ticker permno;
run;

data doubles singles;
    set stocknames;
    by date ticker permno;
    if not (first.ticker and last.ticker) then output doubles;
    else output singles;
run;

proc print data=doubles(obs=20);
run;

The data step could do with some explaining. Since stocknames is sorted by DATE TICKER PERMNO, if each ticker linked to only one permno on each date, it should be the case that each date-ticker combination has only one entry in the dataset. If it does not, then that date-ticker combination links to more than one permno.

This produces output like this:

                        Obs    PERMNO    ticker         date

                        1     29938     BFD      31JUL1962
                        2     29946     BFD      31JUL1962
                        3     27414     BO       31JUL1962
                        4     27561     BO       31JUL1962
                        5     29866     BQ       31JUL1962
                        6     29874     BQ       31JUL1962
                        7     30891     CDL      31JUL1962
                        8     30904     CDL      31JUL1962
                        9     31093     CLI      31JUL1962
                       10     31106     CLI      31JUL1962
                       11     32854     HG       31JUL1962
                       12     32862     HG       31JUL1962
                       13     32934     HUB      31JUL1962
                       14     32942     HUB      31JUL1962
                       15     35553     PDL      31JUL1962
                       16     35561     PDL      31JUL1962
                       17     35115     PKR      31JUL1962
                       18     35123     PKR      31JUL1962
                       19     22517     PPL      31JUL1962
                       20     35051     PPL      31JUL1962
where, as you can see, the doubles dataset contains the desired elements. I then use the singles dataset, which has only one permno for each ticker, in further tests, notably, seeing whether each permno matches to only one ticker.

This process is easily automated in a macro:

%macro separatedups(inds, byvar, dupbyvar);

    proc sort data=&inds;
        by &byvar;
    run;

    data zzsingles zzdoubles;
        set &inds;
        by &byvar;
        if not(first.&dupbyvar and last.&dupbyvar) then output zzdoubles;
        else output zzsingles;
    run;

    %let doubsnobs=0;
    
    data _null_;
        set zzdoubles nobs=nobs;
        if _n_=1 then call symput("doubsnobs", trim(left(nobs)));
    run;

    %if "&doubsnobs"="0" %then %do;    
        %end;
    %else %do;
        %put WARNING: &doubsnobs observations in doubles;
        %end;
    %mend separatedups;
And called with:

  %separatedups(stocknames, date ticker permno, ticker);

I can then check both ways by saying:

  %separatedups(stocknames, date ticker permno, ticker);

  data stocknames;
     set zzsingles;
  run;

  %separatedups(stocknames, date permno ticker, permno);

  data stocknames;
     set zzsingles;
  run;

In the first call to %separatedups, there are multiple permnos for each ticker, so zzdoubles is not empty and the following message is written to the log:

WARNING: 35222 observations in doubles

In the second call, zzdoubles is empty: each permno matches to only one ticker.


next up previous
Next: Sundry macros Up: Writing frequently-used code as Previous: Hash-merging without having to
Andre de Souza 2012-11-19