In my 2009 paper, Trading costs and returns for US equities: estimating effective costs from daily data (Journal of Finance 64(3), 1445-1477; DOI link: http://dx.doi.org/10.1111/j.1540-6261.2009.01469.x), I formed estimates for effective costs for US equities annually, beginning in 1926. This page provides current estimates (through 2009), programs and other materials. The link to the dataset is given at the bottom of this page.
Important: The programs used for these updated computations differ from those used in the original paper. Agreement between the two sets of estimates is good, but not perfect. (There are differences in sample construction and also in random-number algorithms). While I believe the update computations to be accurate, I make no guarantees.
For a given stock, the log bid-ask midpoint (implicit efficient price) is mt, with dynamics
Dmt = b rMt + ut
where t indexes days, rMt is the log return on the CRSP value-weighted market index, and ut ~ N(0, su2). The log closing price is pt, given by
pt = mt + c qt
where c is the effective cost and qt is the trade direction indicator. As described in the paper, if there is no trade on a given day, CRSP reports the midpoint, which is accomodated by setting qt = 0. Otherwise, qt is either -1 (a trade at the bid) or +1 (a trade at the ask). The prior here is that the bid and ask realizations are equally probable, but conditional on the data the posterior probabilities may be very unequal.
The model parameters are c, b and su2. I compute Bayesian estimates for these using the Gibbs sampler described in the paper. The estimates are posterior means for c, b, su2 and su. (The latter two are distinct because the mean of su2 is not the same as the mean of su, and there are situations in which one might prefer an unbiased estimate of one or the other.)
The sample is based on the CRSP daily stock and index files. Estimates are basically formed annually. There may be multiple estimates in a given year if the stock has changed exchange listing or if there has been a split. So in the data set, the primary identifiers for each sample are CRSP permno, year and kSample (=1, 2 ...). Supplemental data include starting and ending dates of the sample, exchange code, cfacpr and shrcd (from CRSP). Estimates are computed only for samples with at least 60 reported trade prices.
The programs that produce the estimates are written entirely in SAS and run on WRDS. There are two program files.
These programs differ from the ones that generated the estimates used in the 2009 paper. For the paper, I formed the sample and extracted the CRSP data using SAS programs on WRDS, computed the estimates using Matlab on an NYU Unix system, and then read the estimates back into a SAS dataset. Matlab is computationally more efficient than SAS (maybe by a factor of ten), but it is cleaner to set things up as one pass, one language, one system.
Agreement between the updated estimates and those that correspond to the published paper is very good, but not perfect. (The earlier estimates are contained in liqestimatesaug2006.sas7bdat, on my GibbsEstimates2006 page.) There are differences in sampling methodology and programming languages. Nevertheless, matched by permno and year, correlations between the two sets of estimates are about 0.985. When the match is restricted to samples that cover at least 240 days, the correlations are about 0.998.
I often receive inquiries regarding Gibbs estimates formed at higher frequencies (e.g., monthly or weekly). I don't provide these estimates due to concerns about their reliability. The 2009 paper describes some of the issues that arise. Briefly, the prior distributions used here are diffuse (to ensure that the posteriors are data-dominated). The priors are generally, however, biased. As the sample size drops, the posteriors start resembling the posteriors, and the bias problem becomes more acute. The only way out of this is to put more structure on the priors. This is not impractical, but it is application-specific. The 2009 paper and the teaching note (next topic) provide some suggestions.
The methodology is described more fully in a teaching note Gibbs estimation of microstructure models. There is a pdf version and a Mathematica notebook. The SAS programs used to generate some of the results in the teaching notes are more general variants of the ones used for the crsp data. They are posted in the ftp directory TeachingPrograms.
The SAS dataset is crspGibbs2009v01.sas7bdat. I've also put the estimates in a plain text file, crspGibbs2009v01.txt.
The contents of both datasets are:
# Variable Type Len Format Informat Label 1 permno Num 8 7. 8. CRSP permno 2 year Num 8 4. Sample year 3 kSample Num 8 2. Sample number within year 4 c Num 8 7.5 c estimate 5 beta Num 8 8.4 beta estimate 6 varu Num 8 10.8 Var(u) estimate 7 sdu Num 8 10.8 SD(u) estimate 8 exchcd Num 8 2. CRSP exchcd for year/kSample 9 shrcd Num 8 2. CRSP shrcd for year/kSample 10 CFACPR Num 8 8.6 CRSP cfacpr for year/kSample 11 firstDate Num 8 DATE. Start date for year/kSample 12 lastDate Num 8 DATE. End date for year/kSample 13 nDays Num 8 3. Number of days in sample 14 nTradeDays Num 8 3. Number of days with realized trade