﻿ Updated Gibbs estimates of effective costs

In my 2009 paper, Trading costs and returns for US equities: estimating effective costs from daily data (Journal of Finance 64(3), 1445-1477; DOI link: http://dx.doi.org/10.1111/j.1540-6261.2009.01469.x), I formed estimates for effective costs for US equities annually, beginning in 1926. This page provides current estimates (through 2009), programs and other materials. The link to the dataset is given at the bottom of this page.

Important: The programs used for these updated computations differ from those used in the original paper. Agreement between the two sets of estimates is good, but not perfect. (There are differences in sample construction and also in random-number algorithms). While I believe the update computations to be accurate, I make no guarantees.

#### Model

For a given stock, the log bid-ask midpoint (implicit efficient price) is mt, with dynamics

Dmt = b rMt + ut

where t indexes days, rMt is the log return on the CRSP value-weighted market index, and ut ~ N(0, su2). The log closing price is pt, given by

pt = mt + c qt

where c is the effective cost and qt is the trade direction indicator. As described in the paper, if there is no trade on a given day, CRSP reports the midpoint, which is accomodated by setting qt = 0. Otherwise, qt is either -1 (a trade at the bid) or +1 (a trade at the ask). The prior here is that the bid and ask realizations are equally probable, but conditional on the data the posterior probabilities may be very unequal.

The model parameters are c, b and su2. I compute Bayesian estimates for these using the Gibbs sampler described in the paper. The estimates are posterior means for c, b, su2 and su. (The latter two are distinct because the mean of su2 is not the same as the mean of su, and there are situations in which one might prefer an unbiased estimate of one or the other.)

#### Sample

The sample is based on the CRSP daily stock and index files. Estimates are basically formed annually. There may be multiple estimates in a given year if the stock has changed exchange listing or if there has been a split. So in the data set, the primary identifiers for each sample are CRSP permno, year and kSample (=1, 2 ...). Supplemental data include starting and ending dates of the sample, exchange code, cfacpr and shrcd (from CRSP). Estimates are computed only for samples with at least 60 reported trade prices.

#### Programs

The programs that produce the estimates are written entirely in SAS and run on WRDS. There are two program files.

• RollGibbsLibrary02.sas sets up a SAS/IML library with IML subroutines that do the estimation. These subroutines are fairly general and can be modified should you wish to experiment with parameter values and priors.
• crspGibbsBuildv01.sas is driver code: this program reads the crsp files, builds the samples and then calls the IML subroutines to actually perform the estimation.

These programs differ from the ones that generated the estimates used in the 2009 paper. For the paper, I formed the sample and extracted the CRSP data using SAS programs on WRDS, computed the estimates using Matlab on an NYU Unix system, and then read the estimates back into a SAS dataset. Matlab is computationally more efficient than SAS (maybe by a factor of ten), but it is cleaner to set things up as one pass, one language, one system.

Agreement between the updated estimates and those that correspond to the published paper is very good, but not perfect. (The earlier estimates are contained in liqestimatesaug2006.sas7bdat, on my GibbsEstimates2006 page.) There are differences in sampling methodology and programming languages. Nevertheless, matched by permno and year, correlations between the two sets of estimates are about 0.985. When the match is restricted to samples that cover at least 240 days, the correlations are about 0.998.

#### Higher frequency estimates?

I often receive inquiries regarding Gibbs estimates formed at higher frequencies (e.g., monthly or weekly). I don't provide these estimates due to concerns about their reliability. The 2009 paper describes some of the issues that arise. Briefly, the prior distributions used here are diffuse (to ensure that the posteriors are data-dominated). The priors are generally, however, biased. As the sample size drops, the posteriors start resembling the posteriors, and the bias problem becomes more acute. The only way out of this is to put more structure on the priors. This is not impractical, but it is application-specific. The 2009 paper and the teaching note (next topic) provide some suggestions.

#### Supplemental

The methodology is described more fully in a teaching note Gibbs estimation of microstructure models. There is a pdf version and a Mathematica notebook. The SAS programs used to generate some of the results in the teaching notes are more general variants of the ones used for the crsp data. They are posted in the ftp directory TeachingPrograms.

#### Datasets

The SAS dataset is crspGibbs2009v01.sas7bdat. I've also put the estimates in a plain text file, crspGibbs2009v01.txt.

The contents of both datasets are:

```#    Variable      Type    Len    Format    Informat    Label

1    permno        Num       8    7.        8.          CRSP permno
2    year          Num       8    4.                    Sample year
3    kSample       Num       8    2.                    Sample number within year
4    c             Num       8    7.5                   c estimate
5    beta          Num       8    8.4                   beta estimate
6    varu          Num       8    10.8                  Var(u) estimate
7    sdu           Num       8    10.8                  SD(u) estimate
8    exchcd        Num       8    2.                    CRSP exchcd for year/kSample
9    shrcd         Num       8    2.                    CRSP shrcd for year/kSample
10    CFACPR        Num       8    8.6                   CRSP cfacpr for year/kSample
11    firstDate     Num       8    DATE.                 Start date for year/kSample
12    lastDate      Num       8    DATE.                 End date for year/kSample
13    nDays         Num       8    3.                    Number of days in sample