
Panel Data Econometrics
Panel
Data Sets
Professor W. Greene
Department of Economics
Office: MEC 7-78, Ph. 998-0876, Fax. 995-4218
E-mail: wgreene@stern.nyu.edu
Home Page: http://stern.nyu.edu/~wgreene
Return to course home page.
Notes: The following list points to a series of data sets.  We will use
some of these in our class discussions.  A number of others are provided
for students to analyze as part of their study of the topic.  Note, there
are two major cross country data bases online that provide a wealth of interesting
data.  These can be accessed directly:  (The sites are accessed below
just by clicking the names.)
The Penn
World Tables
Barro's
Cross Country Data
There are many other sources of data on the
web.  One that is particularly rich is the archives of the Journal of
Applied Econometrics: (Click here to
visit)
Data below are provided in three formats: (1)
The 'Text format' is a plain vanilla ascii text file containing the variable
names at the top of the file followed by the variables, arranged neatly in the
file. (2) The .XLS is the closest thing we have right now to a generic file
format.  Most econometric programs can import an Excel spreadsheet
file.  If your cannot, you can use Excel to write it in another format,
or, perhaps, use the ASCII text file. (3) If you are using LIMDEP or NLOGIT,
the project file can be imported directly into the program, as is.
 - Grunfeld Investment Data, 10 Firms, 20
     Years (1935-1954)  
     Variables in the file are
     Firm = Firm ID, 1,...,10
     Year = 1935,...,1954
     I = Investment
     F = Real Value of the Firm
     C = Rea; Value of the Firm's Capital Stock
     Data are from the Ph.D. dissertation of Y. Grunfeld (Univ. of Chicago,
     1958), See, e.g., Zellner, A., "An Efficient Method of Estimating
     Seemingly Unrelated Regression Equations and Tests for Aggregation
     Bias," Journal of the American Statistical Association, 57,
     1962, pp. 348-368 for analyses of these data. 
 
 - Spanish Dairy Farm Production, N = 247, T = 6
     Variables in the file are
     FARM = Farm ID
     YEAR = year, 93, 94, ..., 98
     Inputs
             COWS,  X1 = log of, deviations from means
     (logs)
             LAND,  X2 = same
             LABOR, X3 = same
             FEED,  X4 = same
             Translog terms = squares and cross
     products: X11, X22, X33, X44, X12, X13, X14,X23, X24, X34 
            YEAR93,...,YEAR98 = year dummy
     variables
     Output
            MILK = farm output
            YIT = log of MILK production 
 
 
 
 - Bank Cost Data, 500 Banks, 5 Years:
       
     Variables in the file are 
Cit     = total cost of transformation of financial and physical
resources into loans and 
             investments = the sum of
the five cost items described below;
Y1it    = installment loans to individuals for personal and
household expenses;
Y2it    = real estate loans;
Y3it    = business loans;
Y4it    = federal funds sold and securities purchased under
agreements to resell;
Y5it    = other assets;
W1it    = price of labor, average wage per employee;
W2it    = price of capital = expenses on premises and fixed
assets divided by the dollar value of 
       of premises and fixed assets;
W3it    = price of purchased funds = interest expense on money
market deposits plus expense of 
       federal funds purchased and securities sold
under agreements to repurchase plus interest
       expense on demand notes issued by the U.S.
Treasury divided by the dollar value of
       purchased funds;
W4it    = price of interest-bearing deposits in total
transaction accounts = interest expense on 
       interest-bearing categories of total
transaction accounts;
W5it    = price of interest-bearing deposits in total nontransaction
accounts = interest expense on
       total deposits minus interest expense on money
market deposit accounts divided by the
       dollar value of interest-bearing deposits in
total nontransaction accounts;
T    = trend variable, t = 1,2,3,4,5 for years 1996, 1997, 1998,
1999, 2000
The data in the file are for a translog cost function, linearly homogeneous in
the input prices.  Specifically,
C = log(Cost/W5), W1,W2,W3,W4 = log(Wj/W5), Q1,...,Q5 = log(Ym), and the
squared and cross product
terms are W11, W12,..., Q11,Q12,..., W1Q1,...,W4Q5, T, T2, TW1,...,TW4,
TQ1,...,TQ5.
 - Dahlberg and Johansson Municipal
     Expenditure Data, 265 Swedish Municipalities, 9 years
     Variables in the file are
     ID = Identification, 1,..., 265
     YEAR = year, 1979,...,1987
     EXPEND = Expenditures
     REVENUE = Receipts, taxes and Fees
     GRANTS = Government grants and shared tax revenues
     See Greene (2003, pp. 551 and elsewhere) for analysis of these data. The
     article on which the analysis is based is Dahlberg, M. and E. Johannson,
     E., "An Examination of the Dynamic Behavior of Local Governments
     using GMM Bootstrapping Methods," Journal of Applied Econometrics,
     15, 2000, pp. 401-416.  (These data were downloaded from the JAE data
     archive.) 
 
 - World Gasoline Demand Data, 18 OECD
     Countries, 19 years
     Variables in the file are
     COUNTRY = name of country (Does not appear in the LIMDEP project file)
     YEAR = year, 1960-1978
     LGASPCAR = log of consumption per car
     LINCOMEP = log of per capita income
     LRPMG = log of real price of gasoline 
LCARPCAP = log of per capita number of cars 
See Baltagi (2001, p. 24) for analysis of these data. The article on which the
analysis is based is Baltagi, B. and Griffin, J., "Gasolne Demand in the
OECD: An Application of Pooling and Testing Procedures," European Economic
Review, 22, 1983, pp. 117-137.  The data were downloaded from the website
for Baltagi's text. 
 - Statewide Capital Productivity Data,
     lower 48 states, 17 years
     Variables in the file are
     STATE = state name
     ST_ABB = state abbreviation (not in project file)
     YR = year, 1970,...,1986
     P_CAP = public capital
     HWY = highway capital
     WATER = water utility capital
     UTIL = utility capital
     PC = private capital
     GSP = gross state product
     EMP = employment
     UNEMP = unemployment rate
     See Baltagi (2001, p. 25) for analysis of these data. The article on which
     the analysis is based is Munell, A., "Why has Productivity Declined?
     Productivity and Putlic Investment," New England Economic Review,
     1990, pp. 3-22.  The data were downloaded from the website for
     Baltagi's text.  
 
 - Cornwell and Rupert Returns to Schooling
     Data, 595 Individuals, 7 Years
     Variables in the file are
     EXP = work experience
     WKS = weeks worked
     OCC = occupation, 1 if blue collar, 
     IND = 1 if manufacturing industry
     SOUTH = 1 if resides in south
     SMSA = 1 if resides in a city (SMSA)
     MS = 1 if married
     FEM = 1 if female
     UNION = 1 if wage set by unioin contract
     ED = years of education
     BLK = 1 if individual is black
     LWAGE = log of wage
     These data were analyzed in Cornwell, C. and Rupert, P., "Efficient
     Estimation with Panel Data: An Empirical Comparison of Instrumental
     Variable Estimators," Journal of APplied Econometrics, 3, 1988, pp.
     149-155.  See Baltagi, page 122 for further analysis.  The data
     were downloaded from the website for Baltagi's text.  
 
 - German Manufacturing Innovation Data,
     1,270 Firms, 5 years
     Variables in the file are
     YEAR = year, 1994-1998
     FIRM = 1,...,1,270
     IP = Product or innovation occurrec, 0/1 variable
     EMPL = employment
     IM = Imports
     IMUM = import share in industry
     FDIUM = FDI share in industry
     PROD = productivity measure
     LOGSALES = log of industry sales
     RAWMTL = dummy for firm in raw materials industry
     INVGOOD = dummy for firm in investment goods industry
     CONSGOOD = dummy for firm in consumer goods industry 
FOOD = dummy for firm in food industry
These data were analyzed in Bertschek, I. and M. Lechner, "Convenient
Estimators for the Panel Probit Model," Journal of Econometrics, 87, 2,
1998, pp. 329-372.  See, also, Greene, Econometric Analysis, 5th ed.,
(2003) for various analyses, and Greene, W. "Convenient Estimators for the
Panel Probit Model: Further Results, Empirical Economics, 2004.  These
data are not publicly available.  Extracts from the data set will be
provided in class. 
 
 - German Health Care Usage Data, 7,293
     Individuals, Varying Numbers of Periods
     Variables in the file are
     Data downloaded from Journal of Applied Econometrics Archive. This is an
     unbalanced panel with 7,293 individuals. They can be used for regression,
     count models, binary choice, ordered choice, and bivariate binary choice.
      This is a large data set.  There are altogether 27,326
     observations.  The number of observations ranges from 1 to 7.
      (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000,
     7=987).  Note, the variable NUMOBS below tells how many observations
     there are for each person.  This variable is repeated in each row of
     the data for the person.  (Downlo0aded from the JAE Archive)
     ID = person - identification number
     FEMALE =  female = 1; male = 0
     YEAR = calendar year of the observation
     AGE = age in years
     HSAT =  health satisfaction, coded 0 (low) - 10 (high)  Note,
     this variable has 40 coding errors. Variable NEWHSAT below fixes them.
     HANDDUM = handicapped = 1; otherwise = 0
     HANDPER = degree of handicap in percent (0 - 100)
     HHNINC =  household nominal monthly net income in German marks /
     10000
     HHKIDS = children under age 16 in the household = 1; otherwise = 0
     EDUC =  years of schooling
     MARRIED =  married = 1; otherwise = 0
     HAUPTS =  highest schooling degree is Hauptschul degree = 1;
     otherwise = 0
     REALS =  highest schooling degree is Realschul degree = 1; otherwise
     = 0
     FACHHS = highest schooling degree is Polytechnical degree = 1; otherwise =
     0
     ABITUR = highest schooling degree is Abitur = 1; otherwise = 0
     UNIV =  highest schooling degree is university degree = 1; otherwise
     = 0
     WORKING = employed = 1; otherwise = 0
     BLUEC = blue collar employee = 1; otherwise = 0
     WHITEC = white collar employee = 1; otherwise = 0
     SELF = self employed = 1; otherwise = 0
     BEAMT =  civil servant = 1; otherwise = 0
     DOCVIS =  number of doctor visits in last three months
     HOSPVIS =  number of hospital visits in last calendar year
     PUBLIC =  insured in public health insurance = 1; otherwise = 0
     ADDON =  insured by add-on insurance = 1; otherswise = 0
     NUMOBS =  number of observations for this person. Repeated in each
     row of data.
     NEWHSAT = recoded value of HSAT with coding errors corrected. 
 
 - World Health Organization Panel Data on
     Health Care Attainment:  191 Countries, 5 Years (Some countries
     fewer)
     These data have been used by many researchers to study the Health Care
     Survey assembled by WHO as part of the Year 2000 World Health Report. On
     the course bibliography, see, for example, Greene (2004a).  Note,
     variables marked * were updated with more recent sources in Greene
     (2004a). Missing values for some of the variables in this data set are
     filled by using fitted values from a linear regression.  To set the
     proper sample for panel data analysis, use observations for which SMALL =
     0.  To obtain the balanced panel, then use only observations with
     GROUPTI = 5.
     COMP = composite measure of health care attainment; LCOMP = logCOMP
     DALE = Disability adjusted life expectancy (other measure); LDALE =
     logDALE
     YEAR = 1993,...,1997;  TIME = 1,2,3,4,5;  T93, T94, T95, T96,
     T97 = year dummy variables
     HEXP = per capita health expenditure; LHEXP = logHEXP; LHEXP2 =
     log-squaredHEXP
     HC3 = educational attainment; LHC = logHC3; LHC2 = log-squaredHC3; LHEXPHC
     = logHEXP * logHC3
     SMALL = indicator for states, provinces, etc. SMALL > 0 implies
     internal political unit, = 0 implies country observation
     COUNTRY = number assigned to country
     STRATUM = another country indicator
     GROUPTI = number of observations when SMALL = 0. Usually 5, some = 1, one
     country = 4.
     OECD = dummy variable for OECD country (30 countries)
     GINI = gini coefficient for income inequality
     GEFF = world bank measure of government effectiveness*
     VOICE = world bank measure of democratization of the political process*
     TROPICS = dummy variable for tropical location
     POPDEN = population density*
     PUBTHE = proportion of health expenditure paid by bublic authorities
     GDPC = normalized per capita GDP; LGDPC = logGDPC; LGDPC2 =
     log-squaredGDPC  
 
   
Return to course home
page.