Evaluating Psychological Research Reports
Dimensions, Reliability, and Correlates of Quality Judgments
Stephen D. Gottfredson
American Psychologist, October 1978
ABSTRACT: A series of studies designed to investigate three major aspects of the peer-evaluation system in psychology is presented. Editors and editorial consultants for nine major psychology journals were surveyed for opinions about the desirability of article characteristics. Dimensional structures for evaluation were explored, resulting in a set of prescriptive norms for assessment. Substantial agreement on the desirability of article characteristics is demonstrated, and psychologists heavily involved in the manuscript decision-making processes associated with different journals apparently employ these dimensions in the same way. These results were used in a second study demonstrating increased reliability of peer judgments of article quality. Finally, it was found that peer judgments of article quality and impact are only very modestly correlated with subsequent citation of the articles.
The scientific enterprise in psychology, as in other disciplines, functions as a social system. Determinations of success, failure, quality, relevance, and prestige are based on a system of peer evaluation. Garvey and Gottfredson (1976) have discussed the importance of peer evaluation and communication to the maintenance of our social system of scientific activity, and Kuhn (1962) offers a theoretical framework within which this process operates. From this perspective, approximations to "truth" are replaced by approximations to the perceived nature of the scientific paradigm; and the author's conviction that his or her view approximates "truth" is of less importance in its evaluation than the peer judgment of its approximation to the paradigm.
Evaluative Processes in Science
The characteristic evaluative phases of the scientific information-flow system have been described elsewhere (Garvey, Lin, & Nelson, 1970; Meadows, 1974). In general, this process is initiated informally among a group of friendly scientific colleagues. In formal evaluation continues until the report is subjected to the editorial processes preceding journal publication, after which it can be scrutinized by the scientific public. In effect, the evaluative process is never-ending — evaluation is associated with decisions to cite work in a review of research in a given area or to include mention of the work in specialized texts and treatises.
High rejection rates ("Summary Report," 1976) and a general interest in quality control have recently focused attention on several phases of this evaluative process. Brackbill and Korton (1970) have reviewed author and reader suggestions for improvement of the system, noting that their survey respondents were skeptical of the objectivity of the review process. However, the majority of prospective authors prefer to have their manuscripts reviewed by two or more persons in addition to the journal editor before publication because of the opportunity for feedback from experts afforded by such evaluation. Despite much skepticism, the virtues of peer review are recognized.
The series of studies reported in this article focuses on three major aspects of our evaluation system: (a) the reliability of peer judgments of article quality, (b) the criteria upon which assessments of article quality are likely to be made, and (c) the relations between peer judgments of article quality and the number of citations made to articles following publication.
RELIABILITY OF REVIEWING PROCESSES
Studies of the reliability of peer-review processes in psychology have been confusing. Division 23 (Consumer Psychology) of the APA conducted a contest to select the best paper submitted to a recent convention. Ten of the past 11 division presidents ranked the papers from 1 to 8 — the resulting coefficient of concordance (Kendall's W) was .11 (Bowen, Perloff, & Jacoby, 1972). Here of course, a natural restriction of range is likely to have occurred — final candidates had been both self-selected and preselected.
McReynolds (1971) reported correlations (adjusted by the Spearman-Brown prophecy formula) among sets of ratings of papers submitted for presentation to the Division 12 (Clinical) meeting at the APA national convention ranging from .21 to .84, the average being .62 (which is approximately .45 before correction). However, because all of the papers were evaluated simultaneously, these coefficients are probably higher than those in the actual journal reviewing process are (Scott, 1974).
In a study designed to more closely approximate the actual reviewing process, Scott (1974) sent a one-page appraisal sheet to double reviewers of 287 manuscripts submitted to the Journal of Personality and Social Psychology. Measures of inter-referee agreement ranged from .07 to .37 for attributes of the manuscripts, and the correlation between the two recommendations to the editor was .26.
In general, the agreement between manuscript raters tends to be low, while agreement about the desirability of specified attributes of manuscripts tends to be somewhat better (Wolff, 1970). However, results have been inconclusive or inconsistent, and little attempt has been made to isolate the actual criteria upon which such assessments are made.
Criteria for evaluation. Philosophers and sociologists of science have long been concerned with the prescriptive norms of scientific activity (Merton, 1957; Storer, 1966). Essentially, these prescriptive norms are the "shoulds" and "shouldn'ts" of science.' Despite some discussion of normative (Lindsey, 1976) and specific (Chase, 1970; Frantz, 1968; Wolff, 1970, 1973) criteria for assessments of scientific quality, considerations of specific criteria have usually been vague or intuitive. One purpose of the present study was to explore potential criteria in an effort to demonstrate increased reliability of peer judgments of article quality.
CITATION COUNTS AND SCIENTIFIC QUALITY
Publication counting holds a traditional place in the evaluation of scientists, and there appears to be modest empirical justification for this practice (e.g., see Clark, 1957; Crane, 1965; Dennis, 1954; Lewis, 1968; Manis, 1951; Meltzer, 1949; Menard, 1971; Roe, 1951, 1961, 1963, 1965; Zuckerman, 1967). Recently, however, citations of scientific works have been proposed as better indexes of productivity and, .by implication, quality.
Citation counting was originally suggested as a method of selecting journals for library subscription (Garfield, 1972; Garfield & Sher, 1963; Gross & Gross, 1927; Raisig, 1960) under the assumption that "citation frequency reflects a journal's value and the use made of it" (Garfield, 1972, p. 476). Much research employing citation measures has focused on the impact of journals on their fields (e.g., Buss & McDermott, 1976; Garfield, 1972).
A second major use of citation counting dates from Clark's (1957) attempt to identify and study eminent American psychologists; Clark noted that the number of journal citations accumulated proved a better predictor of eminence (r = .67) than did other variables (including, counts of publications). Several later studies also documented this relationship between citations and scientific "success" (e.g., Cole & Cole, 1967; Hagstrom, 1971). While both productivity and citation measures are highly correlated with other measures of scientific success, citations of publications appear to provide a modestly better index.
After publishing a paper by Garfield (1970) suggesting the use of citations as a measure of "current scientific performance," Nature was flooded with letters from concerned scientists. A similar reaction followed the publication of a paper in Science (Margolis, 1967). Scientists contended that in many cases, citation of articles (or of people) is capricious, and to base practical evaluation on such a measure without proper validation is dangerous (see Gottfredson, Garvey, & Goodnow, 1977, for a review of this debate). Another major purpose of this article is to explore the assumed but inadequately tested relation between citation counts and the judged quality of cited articles.
Purpose and Method
The present studies explore the ways journal articles are judged in an effort to achieve increased reliability of peer judgments of article quality, and it examines the assumed but untested relation between citation counts and the judged quality of the cited articles. My purposes are thus (a) to describe relations among criteria upon which assessments of the quality of psychological research reports might be made, (b) to examine areas of agreement or disagreement among psychologists concerning these criteria, (c) to determine whether use of such criteria for evaluation might increase the reliability of peer judgments of the quality of our psychological literature, and (d) to explore the relation between peer judgments of article quality and citation counts.
To achieve these goals, I needed samples of (a) psychologists competent to describe criteria used in the assessment of psychological research, (b) psychological works to evaluate, and (c) judges competent to evaluate these works. Descriptions of each sample are given in following sections.
Mail surveys, details of which varied by sample, were used throughout the conduct of this research. The general procedure for each involved an initial mailing consisting of a cover letter briefly explaining the nature of the task, the questionnaire, and a self-addressed, stamped return envelope. Three to 4 weeks after the initial mailing, and again 3 to 4 weeks later, a follow-up was mailed to all nonrespondents.
This research focused on nine journals (Psychometrika, the Journal of Personality and Social Psychology, the Journal of Experimental Psychology, the Journal of Applied Psychology, the Journal of Abnormal Psychology, the Journal of Comparative and Physiological Psychology, the Journal of Consulting and Clinical Psychology, Psychological Bulletin, and Psychological Review) chosen to sample broadly from psychological literature rather than to include the entire range.2
Study of Evaluative Criteria
SAMPLE
To obtain lists of persons highly involved in the manuscript decision-making processes associated with the nine selected journals, all editors, associate editors, and consulting editors for the years 1968-1975 were included in the initial sample pool. Three journals were thereby underrepresented because their associate and/or consulting editors were not regularly acknowledged. Names of all issue consultants for these three journals were obtained, a tally was made of the number of times each was acknowledged, and the 15% most frequently mentioned were included in the sample pool. Persons active in more than one journal were randomly assigned membership in one group, resulting in a nonredundant group of 757 psychologists. Of these, 175 were found to be authors of articles to be sampled in a later phase of the study, and 37 either had died or could not be located, resulting in a final sample of 545.
SURVEY
A questionnaire containing 83 statements describing attributes of psychological journal articles was mailed to each sample member during April-July 1976. Of the 545 questionnaires mailed, 338 (62%) were returned, of which 299 were fully completed, resulting in an analyzable response of 55%. No differences were found between respondents and nonrespondents in terms of journal affiliation or editorial role.
Instructions sought to ensure that responses to items (intended to represent "simple" characteristics of articles) would be as independent of one another as possible given the practical constraints imposed by the questionnaire situation. Subjects were asked for their opinions of the relative quality of an article that might be described (at least in part) by each of the statements and were asked to respond to each item on a 7-point scale ("clearly outstanding" to "clearly inferior") relative to published articles in their own areas of expertise (see Table 1 for example items).
TABLE 1 |
|||
Proportionate Subsampling (N = 142) |
|||
Summary of Principal Components Solution — 83 Questionnaire Items |
|||
Component |
Primary |
h2 |
Questionnaire item |
Component I |
.78 |
.68 |
The problem has not been considered carefully enough. |
|
.71 |
.65 |
The design used does not justify the conclusions drawn. |
|
.70 |
.56 |
It misrepresents other viewpoints, literature, data, etc. |
|
.69 |
.59 |
The author misinterprets the results. |
|
.69 |
.60 |
The author is apparently not aware of recent developments in the field. |
|
.67 |
.55 |
The experiment conducted does not address the stated question. |
|
.66 |
.51 |
The analytical procedures used are misunderstood or misapplied. |
|
.65 |
.51 |
It shows evidence of poor scholarship. |
|
.65 |
.51' |
The introduction hides the issue being investigated. |
|
.64 |
.54 |
The research was poorly executed. |
|
.61 |
.53 |
The point of the research is never clearly stated. |
|
.59 |
.48 |
The author is apparently not up-to-date with the literature. |
|
.57 |
.59 |
It makes unjustified assumptions. |
|
.50 |
.43 |
The author uses lofty scientific jargon when plain English will do. |
|
.48 |
.47 |
It contains inflammatory, inappropriate, or unscientific rhetoric. |
|
.46 |
.48 |
It appears to have been undertaken primarily to win a publication for the author. |
|
.41 |
.38 |
It shows a general lack of theoretical understanding or insight. |
Component II |
.79 |
.70 |
It attempts to unify the field. |
|
.65 |
.55 |
It deals with an important topic. |
|
.60 |
.49 |
It has excellent generalizability. |
|
.59 |
.62 |
It is exciting to read. |
|
.58 |
.52 |
It is clever and innovative. |
|
.57 |
.58 |
The results have practical, applied implications. |
|
.55 |
.63 |
It integrates divergent theoretical perspectives into a single structure. |
|
.53 |
.48 |
It is comprehensive. |
|
.52a |
.57 |
It proposes a new theory to explain existing observations. |
|
.51 |
.55 |
It illuminates a new problem. |
|
.45 |
.47 |
It summarizes research in the field. |
Component III |
.65 |
.62 |
It is well written. |
|
.64 |
.56 |
It avoids unrealistic speculation. |
|
.64 |
.49. |
The results are clearly presented. |
|
.64 |
.48 |
It contains a brief and comprehensive abstract. |
|
.61 |
.56 |
It is easily understood. |
|
.61 |
.65 |
The reader is able to follow what the author is doing. |
|
.60 |
.47 |
Key references are available. |
|
.58 |
.49 |
It provides full information for interpretation and evaluation. |
|
.58 |
.49 |
It contains comprehensive tables and figures. |
|
.56 |
.50 |
The graphics are legible and attractively prepared. |
|
.52 |
.39 |
The research was competently conducted. |
|
.46 |
.46 |
The finding is well related to other relevant work. |
Component IV |
.69 |
.58 |
It makes the reader think about something in a different way. |
|
.64 |
.59 |
It offers a new perspective on an old problem. |
|
.61 |
.62 |
It integrates data or findings from diverse sources into a coherent picture. |
|
.58 |
.51 |
It provides a useful theoretical framework from which interesting and meaningful questions can be evaluated. |
|
.55 |
.48 |
It reports some interesting new empirical observations. |
|
.52 |
.45 |
It provides new ideas for other investigators. |
|
.51 |
.55 |
It aids in the understanding of complex issues. |
|
.40 |
.45 |
It contributes a new methodology. |
Component V |
.81 |
.73 |
The problem addressed is trivial. |
|
.79 |
.70 |
The results are trivial or unimportant. |
|
.65 |
.62 |
It contributes little or nothing to the field. |
|
.64 |
.54 |
It is aimed at trivial segments of theory or observation. |
|
.60 |
.55 |
It leaves you with a feeling of "who cares" or "so what." |
|
.56 |
.47 |
The findings are not memorable. |
|
.49b |
.59 |
It is a mediocre article on a popular topic. |
|
.43 |
.46 |
It is dull. |
Component VI |
.58 |
.57 |
It speaks to the central problems facing the discipline. |
|
.51c |
.50 |
It provokes much useful controversy. |
|
.49 |
.51 |
It outlines implications for future work. |
|
.48 |
.39 |
The topic is interesting, but it provides no result applicable to real-world problems. |
|
.47 |
.57 |
It deals with conceptual clarification. |
Component VII |
.77 |
.65 |
It is heavy on results, light or "spotty" on discussion of results. |
|
.56 |
.48 |
It contains more data, but no new insights. |
|
.56 |
.50 |
It emphasizes description rather than explanation. |
|
.49d |
.46 |
It does not draw conclusions. |
Component VIII |
.56 |
.42 |
The author uses precisely the same procedures as everyone else. |
|
.52 |
.50 |
The results are not overwhelming in their implications, but they constitute a needed component of knowledge. |
|
.48 |
.39 |
It is a continuation of previous studies by the same author. |
Component IX |
.59 |
.46 |
It presents a small amount of data from a large research project. |
|
.54 |
.41 |
The findings seem puzzling. |
|
.52 |
.42 |
It focuses exclusively on one side or aspect of the problem. |
|
.44 |
.51 |
The topic is of interest to a relatively small number of psychologists. |
Component X |
.54 |
.51 |
The author's understanding of historical perspective is demonstrated. |
|
-.51 |
.57 |
It shows that data do or do not conform to some theoretical expectation. |
|
.41 |
.42 |
It contains humor or vivid images. |
Notes. Varimax rotation. Loadings < |.40| are ignored due to space limitations. It should be noted, however, that the structure is exceptionally clean — only 5% of the remaining coefficients meet or exceed 1.301. The complete matrix is available from the author. |
|||
The following eight items loaded on no component: It challenges or contradicts a well-established fundamental assumption; the findings seem reasonably defensible' on methodological grounds; there is some omission of technical terms where they are needed (e.g.. 'average' when mean would be better); it lends support rather than breaks new ground; it makes intuitive sense; it is a replication or a near-replication; it is a well-designed investigation of a small problem; the topic is of general interest. |
|||
a Item loads .41 on Component IV. |
|||
b Item loads .48 on Component I. |
|||
c Item loads .40 on Component II. |
|||
d item loads .42 on Component I. |
PROPORTIONATE SAMPLING ANALYSES
This initial survey was designed to allow the development of an instrument suitable for the evaluation of articles published in the nine journals. Survey respondents were therefore proportionately sampled (on a random basis) to reflect this article-pool; thus respondents were appropriately weighted according to the relative impact (in terms of articles published) that each one's journal had on the psychological literature under consideration. Analyses reported in this section are based on 142 cases.
Analyses of responses to items on the questionnaire show that item variances are relatively homogeneous, while means vary considerably. Inspection of these distributions suggested mild end effects (Torgerson, 1958). Accordingly, data were transformed following the successive-intervals scaling procedure (Diederich, Messick, & Tucker, 1957) to minimize displacement of item means.
The analyses reported in this section address two issues: (a) the extent to which sample members used the response scale in the same way (i.e., agreed on the placement of individual items), and (b) the reduction of the 83 items to a smaller set of dimensions reflecting the relations among items. Both issues were addressed via factor analysis, but the coefficients upon which the analyses were based differ. A principal-component analysis is appropriate for the latter investigation, and a matrix of product-moment correlations served as input. To address the former, however, a coefficient is needed which retains information about the item means. I have therefore based this analysis on a matrix of cross products standardized with respect to vector length.3
Results.4 A principal-component analysis of the 83-variable matrix of standardized cross products was performed, and the eigenvalues for the first three principal components were 49.75, 4.62, and 2.21. Nine components had eigenvalues greater than 1.0. The magnitude of the first component (which accounts for 60% of the total variance) suggests that in this subsample there is substantial agreement regarding the "placement" of each of these items on the scale. That this component in fact reflects agreement on the scale value of the items is clear from an inspection of the joint distribution of item means and loadings on the first (unrotated) component. The function is monotonic, although not linear, with very little scatter evident.
Subsequent principal-component analyses were performed on a matrix of product-moment coefficients. The absolute values of the coefficients ranged from .00 to .73, and the means of the absolute value of the correlation of each item with all others ranged from .08 to .22. Although 23 eigenvalues were greater than 1 (range: 12.5-1.0), only the first 10 components (51% of the total variance) were rotated, resulting in a readily interpretable structure.
Principal-component solution. Of the 10 components summarized in Table 1, only the first 9 (49.6% of the variance) are interpreted. The components are quite well-defined, with few items loading on more than one component (for details, see Gottfredson, 1977).
The first component might properly be labeled a list of "don'ts" — practices to avoid if we want our peers to be favorably impressed with our work. The second and third components seem to suggest a differentiation of two types of "do's" — those dealing primarily with scientific or substantive matters (Component II) and those dealing with stylistic, compositional, or expository matters (Component III). Component IV suggests the importance of originality and heurism, and the fifth component might be labeled "trivial."
While the number of items defining each of the remaining components is small, as is the proportion of the total variance for which each accounts, they nonetheless remain readily interpretable. Component VI seems primarily to reflect scientific advancement, while Component VII seems to merit the label ''data grinders'' or ''brute empiricism," with emphasis on description rather than explanation. The eighth component might be labeled "routine" or "ho-hum" research and · brings to mind the remark one respondent penciled in the questionnaire margin: "How about an item like, 'Oh God, there goes old so-and-so again! '?" The final interpreted component primarily reflects narrowness of research concerns.
FULL-SAMPLE ANALYSES
Given the vast differences in subject matter and methodological approaches represented by these nine journals, one might suspect that journal-group differences with respect to the dimensional structure would be evident. This hypothesis was tested on the heterogeneous sample of 299 respondents.
Each respondent in the full sample was scored on each of nine scales constructed from items loading on the nine interpretable components described above. Scale scores were entered as discriminating variables in a stepwise multiple-discriminant-function analysis (with journal affiliation as the variable to be distinguished) to determine whether members of the respective groups differ in their treatment of these scales. Analyses were performed without regard to prior knowledge of group size.
Results. The hypothesis that group members differ with respect to treatment of the nine scales was not confirmed. The original value of Wilks's lambda, which assesses potential discriminability based on the scale scores, is .838 [X2(24) = 51.6, p < .01], which, while suggesting significant discriminability, also suggests that this (presumably reliable) effect is extremely weak. After removing the effect of the first discriminant function, lambda increased to .926 [X2(l4) = 22.5; p < .07], which suggests that remaining discriminability is not reliable. While discriminations based on these functions are statistically significant (marginally so for the second discriminant), the discrimination itself is of little practical significance. By using information contained in the discriminant functions, it is only possible to correctly predict group membership for 18% of the cases.
SUMMARY
These analyses have suggested that (a) psychologists heavily involved in manuscript review for nine major psychological journals agree remarkably on the desirability of specific characteristics of journal articles, (b) a well-defined dimensional structure can be obtained which accounts for half of the remaining ''individual differences'' variance, and (c) these dimensions are employed in similar ways by persons across subdisciplines in psychology.5
These results suggest that prescriptive norms for scientific evaluation exist and transcend sub-disciplinary bounds. There are things we should (as researchers and authors) do, and there are things we shouldn't do; and many of these behaviors are allegedly prominent in the peer evaluation process. Our next task, then, is to determine if in fact these criteria can be used to achieve increased reliability of peer evaluations of psychological work.
Evaluating Psychological Reports
Two samples were needed for this phase of the investigation: a sample of psychological works to evaluate and a sample of judges competent to evaluate these specific contributions. Since a later portion of the study focuses on the relation between citation counts and quality judgments, the target year selected for study was 1968. Science Citation Index coverage is adequate for that and succeeding years but not for previous years, and the elapsed time span is sufficient for citation analysis (Garfield, 1972; Price, 1965). Thus, articles published during the calendar year 1968 in the nine journals listed earlier provided the sample of psychological works to evaluate.
A sample of judges competent to evaluate these specific works was difficult to obtain for a number of reasons. My solution was simply to survey the authors of these articles and to ask them for the names of three persons whom they considered competent to evaluate the significance of their article in the current framework of psychological knowledge. Persons nominated comprised the "expert" sample pool. This procedure was chosen primarily because of its practicality, simplicity, and proven productivity (Gottfredson et al., 1977). Although this approach may appear to introduce potential bias in the evaluations, (a) the nature of the biasing effect (if any) may actually be conservative if it results in a restriction of range in the ratings, (b) several journals (e.g., Science, Personality and Social Psychology Bulletin) use peer reviewers nominated by manuscript authors (although reviewers may not be limited to those nominated by authors), and (c) subsequent analyses suggest that the procedure had minimal impact on the evaluations made (see Table 4).
SAMPLES
Target articles and authors. A total of 1,289 substantive articles appeared in the nine target journals during 1968. There were, however, only 1,096 single- or first-listed authors (i.e., many authors published two or more articles in these journals during 1968). For each of these authors, one article was selected at random, resulting in a final article pool of 1,096. Thirteen of these authors were dead, and no address could be found for 101 others, resulting in a survey base of 982 authors and articles.
Survey procedures resulted in the return of 692 questionnaire forms, of which 687 (70% of the survey base) were usable. Survey respondents were distributed among the nine journals in about the same proportions as in the target pool [X2(8) = .84, ns]. Of the 687 responding authors, 148 (21.5%) named no experts,6 16 (2.3%) named one, 43 (6.3%) named two, and 480 (69.9%) named three or more experts. Although 1,550 nominations were' made, only 943 individual experts were named.
Experts. In order to maximize the number of articles to be judged, the following assignment procedure was used: Any article for which only one expert had been nominated was assigned that expert, provided it was not in competition with another article for which that single reviewer had also been suggested. Where such obtained, the expert was randomly assigned to one of these articles, and the other(s) was dropped from further consideration. This procedure was then repeated with articles for which multiple experts had been nominated. Once assigned, an expert was excluded from further consideration.
Only 12 articles could not be assigned at least one expert. After excluding experts known to be dead, those for whom no address could be located, and those found to have been (or who later identified themselves as) coauthors of the article with which they had been identified, a final sample of 870 experts remained to judge 527 articles (54% of the original article pool, and 77% of the pool of articles for which the author had responded).
This survey was conducted during November 1976-March 1977. Sample members were assumed to have ready access to the published article they were asked to evaluate, although copies were mailed upon request.
Of the 870 questionnaires mailed, 540 were returned (a 62% response rate). One return was received for 258 articles, two returns for 105 articles, and three returns for 24, resulting in at least one response for 387 (73%) of the available articles and at least two responses for 129 (25%). Respondents were distributed among the nine journals in the same proportions as articles published by these nine journals [X2(8) = 7.57, ns].
CONSTRUCTION OF EVALUATION SCALES
Given the agreement shown with respect to scale placements of the 83 items discussed above, it would have been possible to build a single "evaluation scale" by simply selecting items falling along a wide range of the original dimension. It is also clear, however, that a multidimensional approach adds information, and a multidimensional approach was therefore followed throughout.
Three criteria were employed to select 36 scale items from the initial 83. First, an item was to load heavily on one component and essentially zero on all others. Second, its variance was to be as small as possible. These two constraints served to ensure (a) that items were empirically good exemplars of their respective principal components, (b) that the resulting nine scales would be as orthogonal as possible, and (c) good agreement with respect to the "value" of the items. The third criterion, given that the first two had been met, was that the item be a subjectively good exemplar of the component with which it had been identified.
JUDGMENTS OF TARGET ARTICLES
In addition to rating the article on the 36 items, experts were asked to make three global assessments of the quality and two of the impact of articles in the sample. They were first requested to compare the article to others published at about the same time and dealing with similar topics or problems as well as to others on the same topic regardless of publication date.
Since issues of quality in science are relative and timebound (Kuhn, 1962; Polanyi, 1963), these items were intended to clarify the domains of comparison. Experts indicated judgments on a 7-point scale bounded by the categories "'clearly inferior to most articles treating similar topics/ problems" and "clearly superior to most articles treating similar topics/problems." A third item requested an overall judgment specifically tied to scientific quality regardless of either subject matter or publication date. Again, a 7-point response scale was used, bounded by the categories "exceptionally low quality; few,, if any, articles worse" and "exceptionally high quality; few, if any, articles better."
Results of an earlier study (Gottfredson et al., 1977) suggested a clear distinction between the quality of scientific works and the impact that those works have on their fields. Experts were asked to give their general impression of the impact the article had had upon (a) its specific subject-matter area, and (b) psychological knowledge in general. Experts indicated both judgments on a 7-point scale bounded by the categories "no impact" and "great impact."
Overall quality and impact scales. Table 2 gives the matrix of intercorrelations for these five items. The three quality rating items are highly correlated, as are the two impact ratings, while the correlations across quality and impact items are moderate. Accordingly, the three quality items were summed, as were the two impact items, resulting in a "quality scale" and an "impact scale." The correlation between the quality and impact scales is .58 (N = 378).
TABLE 2 |
|||||
Correlations Among Experts' Quality and Impact Judgments |
|||||
|
Judgment |
2 |
3 |
4 |
5 |
1. |
Evaluation relative to other |
.84 |
.74 |
.53 |
.48 |
|
Works same time/topic |
(383) |
(382) |
(380) |
(378) |
2. |
Evaluation relative to other |
|
.78 |
.48 |
.49 |
|
Works any time/same topic |
|
(383) |
(381) |
(379) |
3. |
Overall quality |
|
|
.52 |
.52 |
|
|
|
|
(382) |
(380) |
4. |
Impact on subject |
|
|
|
.74 |
|
Matter |
|
|
|
(380) |
5. |
Impact on psychological |
|
|
|
|
|
Knowledge |
|
|
|
|
Note. Cell Ns are in parentheses. |
RELIABILITY OF EXPERTS' JUDGMENTS
Two important types of reliability must be considered — interjudge reliability (agreement across judges with respect to assessments) and intrajudge reliability. The latter can be thought of both in terms of a measure of a given judge's consistency with respect to his or her judgments and as a measure of the reliability (internal consistency) of the measuring instrument itself (in this case, the set of scales).
Table 3 presents interrater and homogeneity coefficients for the overall quality and impact scales described above. The internal consistency coefficients are based on all articles for which there was at least one judgment, and interjudge coefficients are based on all articles for which at least two experts were available (where more than two experts were available, extras were randomly excluded). Both scales have high internal consistencies, especially considering the small number of items composing each. Interjudge agreement, however, is relatively modest.
TABLE 3 |
|||||
Internal Consistency and Interjudge Reliability Coefficients for Evaluation Scales |
|||||
Scale |
Number of items |
Correlation with quality scale |
Correlation with impact scale |
Internal consistencya |
Interjudge agreementb |
Overall quality and impact scales |
|||||
Quality |
3 |
|
.58 |
.92 |
.41c |
|
|
|
(378) |
(382) |
(121) |
Impact |
2 |
.58 |
|
.85 |
.35c |
|
|
(378) |
|
(380) |
(122) |
Scales developed from study of evaluative criteria |
|||||
1. Don'ts |
5 |
-.59 |
-.36d |
.78 |
.16 |
|
|
(329) |
(327) |
(331) |
(92) |
2. Substantive do's |
4 |
.67 |
.58d |
.58 |
.50c |
|
|
(333) |
(332) |
(335) |
(95) |
3. Stylistic/compositional do's |
5 |
.45 |
.36d |
.74 |
.20e |
|
|
(332) |
(331) |
(335) |
(99) |
4. Originality/heurism |
4 |
.66 |
.51d |
.86 |
.37c |
|
|
(338) |
(336) |
(340) |
(97) |
5. Trivia |
4 |
-.67 |
-.58d |
.89 |
.40c |
|
|
(337) |
(335) |
(339) |
(97) |
6. Where do we go from here? |
4 |
.56 |
.49d |
.64 |
.45c |
|
|
(319) |
(318) |
(321) |
(83) |
7. Data grinders |
3 |
-.53 |
-.36d |
.70 |
.49c |
|
|
(337) |
(335) |
(339) |
(95) |
8. Ho-hum research |
3 |
-.15 |
-.05d |
.10 |
.22e |
|
|
(318) |
(318) |
(320) |
(83) |
9. Magnitude of problem/interest |
4 |
-.33 |
-.38 |
.13 |
.19e |
|
|
(317) |
(316) |
(319) |
(91) |
Combined evaluation scales |
|||||
Scales 1-9 combined |
__f |
.75 |
.60 |
.86 |
.46e |
|
|
(338) |
(336) |
(340) |
(96) |
Scales 1-7 combined |
__f |
.72 |
.60 |
.89 |
.49e |
|
|
(338) |
(336) |
(340) |
(96) |
Note. Cell Ns are in parentheses. |
|||||
a Cronbach's alpha. b Intraclass coefficient. c Coefficient significantly greater than 0 (p < .01). d Scale correlates significantly less well with impact than with quality (p < .05). e Coefficient significantly greater than 0 (p < .05). f Number of items ranges from 25 to 36 (Scales 1-9 combined) and 20 to 29 (Scales i-7 combined). |
Reliability of evaluative scales. Each expert was asked to indicate whether each of the 36 items derived from the study of evaluative criteria was characteristic or descriptive of the article he or she had read. Responses were made on a 6-point scale bounded by the categories "strongly disagree" and "strongly agree" (that the statement is characteristic or descriptive of the article).
Scores for each respondent for each of the nine scales were examined relative to the quality and impact scales, and in terms of both reliability measures. Table 3 also summarizes these results. All nine scales are correlated in the expected direction with the quality and impact measures, although sizable differences in the magnitude of these coefficients are evident. Scale 8 ("Ho-hum research") is essentially uncorrelated with either quality or impact — a reflection of the unreliability of the scale. In all cases but one, reliable scales evidence significantly lower correlations with the impact than with the quality measure. While this could reflect the degree of independence of these two measures (the quality and impact measures correlate .58), it could also be due to the fact that the development of these nine scales was concretely tied to "quality."
Internal consistency coefficients for all but the last two scales are quite acceptable. It should be noted that these final two scales (a) accounted for very little variance in the dimensional structure obtained from the study of evaluative criteria and (b) fell toward the middle of the response continuum for that survey — indicating their relative irrelevancy.
These two scales, as well as Scale 1 ("Don'ts") and Scale 3 ("Stylistic/compositional do's"), show low agreement across judges. Again, inspection shows that this is due to lack of variance in the subsample. The remaining five scales all demonstrate acceptable reliability relative to results reported previously (e.g., Scott, 1974).
Reliability of combined scales. Missing data (i.e., nonresponses to single items) were a problem in the present research. The internal and interjudge reliabilities just discussed are based on all cases for which (for a given scale) responses were complete. When combining scales, however, far too many cases are lost. To counteract this problem, the mean judgment over items on a given scale was computed for each respondent who had completed the majority of items on that scale; it was then used in the remaining analyses discussed in this article.7
To assess agreement across judges over all scales, scales that were negatively correlated with the quality and impact judgments were reflected and all scale scores summed. As noted earlier, little agreement is evidenced on four scales (due in two cases to a lack of variance in the subsample, and in two others to the unreliability of the scales themselves). As expected, agreement across experts is better for the subset of seven reliable scales than for the full set of nine, as is the internal consistency measure (see Table 3). For neither set of scales does reliability increase substantially over the reliabilities of specific individual scales.
SUMMARY AND DISCUSSION
In general, these analyses have documented greater reliability of peer judgments of article quality than has been presented in past reports (i.e., Bowen et al., 1972; McReynolds, 1971; Scott, 1974). Although agreement across judges is only moderate, the internal consistency evident suggests that the relative lack of agreement is not due simply to unreliability in the scales themselves.
As noted earlier, the use of experts nominated by the article authors might have two effects. First, we might expect agreement from this set of judges simply because all experts were so named by the authors of the articles they judged. The data suggest, however, that this is not the case. Experts were asked a series of questions designed to assess the extent of their familiarity with (a) the field represented by the article they were to judge and (b) the author(s) of the articles. Table 4 gives the correlations between responses to several of these items and the experts' judgments of the target articles. Although all of these coefficients are statistically significant, they are of little practical importance. For example, whether experts are personally acquainted with the author accounts for only 3% of the variance in judgments of the quality or impact of the articles.
TABLE 4 |
|||
Correlations Between Experts' Judgments and Acquaintance with Authors |
|||
|
Experts' judgments of target articles |
||
Acquaintance measure |
Experts' judgments of quality |
Experts' judgments of impact |
Combined evaluative scales |
Had you read or scanned this article before? |
.24 |
.21 |
.23 |
|
(336) |
(334) |
(338) |
Are you acquainted with any other works by the author of this article? |
.16 |
.16 |
.12 |
|
(337) |
(333) |
(339) |
Are you personally acquainted with the author of this article? |
.16 |
.17 |
.16 |
|
(338) |
(336) |
(340) |
Note. Cell Ns are in parentheses. |
It may be the case, however, that articles in this sample are in fact "good" articles. Further study may need also to include articles from less prestigious journals (Xhigness & Osgood, 1967) — for the more heterogeneous the sample of articles, the better the reliability of the judgments should be.
On this point, it is interesting to note that this is a relatively homogeneous set of articles, and we might thus expect relatively modest reliability. The problem faced by editors receiving manuscripts for publication, however, is somewhat different. The pool of incoming manuscripts is likely to be much more heterogeneous with respect to quality than a pool of published manuscripts. Hence, we would expect reliability to be better. In other words, the present research has achieved better agreement on a less heterogeneous sample.
Citation Counts and Peer Judgments
Much effort has been expended in a search for measures of quality in science, and attention has recently focused on citation counts. Citation analysis has primarily been used in an evaluative fashion to attempt to identify significant contributors to science, be they individuals (Bayer & Folger, 1966; Chubin, 1973; Clark, 1957; Cole & Cole, 1967; Dennis, 1954; Garfield, 1970), or laboratories (Westbrook, 1960), or journals (Buss & McDermott, 1976; Garfield, 1972; Inhaber, 1974). The problem of identifying a significant contribution of a scientist to science, however, has received less attention — despite the obvious assumption that citations of papers reflect a measure of their quality. This section makes use of the two studies previously reported to examine this issue.
METHOD
The Science Citation Index was searched for all citations made of the 687 articles in an 8-year period following the date of their publication (1968). Specific notation was made of self-citations (defined as citations of the referent article by its single or first-listed author in subsequent publications) and citation in review articles. For a 10% subsample (n = 66), a second independent count was made of Science Citation Index entries for those years in which citation was heaviest (and hence the tabulation most difficult) to assess the accuracy of the counts. Intercoder reliability coefficients (r) for the various citation measures were all well above .9.
DATA CONSIDERATIONS
Distributions of citations, whether of articles, journals, or people, are highly skewed (see Table 5). Results of correlational analyses based on distributions as highly skewed as these can be misleading, since the least-squares model gives disproportionate weight to deviant scores. Hence, the citation data were transformed [X' = loge(X + 1)] to ameliorate the disproportionate weighting of extreme scores.8
TABLE 5 |
|||||
Ranges and Measures of Central Tendency for Citation Measures |
|||||
Citation measure |
Minimum value |
Maximum value |
Mean |
Median |
Mode |
Full sample of target articles (N = 687) |
|||||
Total citations |
0 |
160 |
10.7 |
5.9 |
2 |
Total self-citations |
0 |
12 |
1.0 |
.5 |
0 |
Total citations by others |
0 |
158 |
9.7 |
5.1 |
2 |
Total citations in reviews |
0 |
7 |
.8 |
.5 |
0 |
Target articles judged by at least one expert (N = 387) |
|||||
Total citations |
0 |
160 |
12.4 |
6.6 |
2 |
Total self-citations |
0 |
7 |
1.1 |
.6 |
0 |
Total citations by others |
0 |
158 |
11.3 |
5.6 |
2 |
Total citations in reviews |
0 |
7 |
.8 |
.5 |
0 |
RESULTS
Table 6 contains the product-moment correlation coefficients obtained between experts' judgments of both the quality and impact of the target articles for which at least one response was received and the citations made of those articles during the 8-year period following their publication. While largely statistically significant, these relations are very weak. The highest observed (between experts' judgments of impact and the log of the total citations made of the articles) was .37 (p < .001). Only 14% of the variance in experts' judgments of impact can be accounted for, given knowledge of the number of citations made of the same articles in the 8 years following their publication. The relation relative to experts' judgments of article quality is even weaker (r = .24; p < .001). None of the individual evaluative scales approaches even this degree of relation with the citation measure although the combined set of evaluative scales correlates with the citation measures to essentially the same extent as does the overall judgment of article quality. Finally, it is apparent that controlling for self-citation is not necessary (cf. Cole & Cole, 1971).
TABLE 6 |
|||
Correlations Between Experts' Judgments and Citation Measures |
|||
|
|
Citation measure |
|
|
Total citations |
Total review |
|
|
Total citations |
by others |
Citations |
Experts' judgments |
(Loge + 1) |
(Loge + 1) |
(Loge + 1) |
Quality scale (382) |
.24 |
.22 |
.11 |
Impact scale (380) |
.37 |
.36 |
.16 |
Scale 1-Don't's (331) |
-.05 |
-.04 |
.03 |
Scale 2-Substantive do's (335) |
.23 |
.22 |
.15 |
Scale 3-Stylistic/compositional do's (335) |
.08 |
.07 |
.06 |
Scale 4-Originality/heurism (340) |
.15 |
.13 |
.07 |
Scale 5-Trivia (339) |
-.18 |
-.16 |
-.07 |
Scale 6-Where do we go? (321) |
.17 |
.17 |
.06 |
Scale 7-Data grinders (339) |
-.10 |
-.09 |
-.01 |
Scale 8-Ho-hum research (320) |
-.12 |
-.11 |
-.06 |
Scale 9-Magnitude of problem/interest (319) |
-.05 |
-.08 |
-.13 |
Scales 1-9 combined (340) |
.19 |
.18 |
.10 |
Note. Cell Ns are in parentheses. |
Judgments averaged across experts are more reliable than are the individual judgments (ric = .58 for the quality scales; ric = .52 for the impact scale). As demonstrated in Table 7, no change is evident when these combined scores are correlated with the citation measure.
TABLE 7 |
|||
Correlations Between Citation Measures and Averaged Judgments of Target Articles |
|||
|
|
Citation measure |
|
|
|
Total |
Total |
|
Total |
citations |
review |
Experts' judgments |
citations |
by others |
citations |
Averaged quality judgments (127) |
.21 |
.20 |
.13 |
Averaged impact judgments (127) |
.27 |
.27 |
.14 |
Note. Cell Ns are in parentheses. |
Issues of heteroscedasticity. Hagstrom (1971) and others (Gottfredson et al., 1977) have suggested that the relations between citation measures and other indexes of scientific quality may be strongly heteroscedastic, due primarily to the extreme skew of citation distributions; this is in fact the case for the present data (despite the log transformation). The joint distributions of the various peer-judgment and citation measures suggest that the relations are markedly better for higher values of the citation measure than for lower values.
Accordingly, the sample of 387 articles was split into two subsamples such that one group of articles fell below and one above the median on the citation measure. Since the median number of citations for this set of articles is 6.5, values of the citation measure could range from 0 to 6 for the low group, and from 7 to 160 for the high group. Correlations between citation measures and the quality and impact judgments for these two groups are given in Table 8. It is evident that the modest correlation found earlier between citations and the quality and impact judgments for this set of articles is due almost exclusively to association at the higher values of the citation measure.
TABLE 8 |
||
Correlations Between Expert Judgments and Total Citations for High- and Low-Citation Groups |
||
|
Total citations (Loge + 1) |
|
Experts' judgments |
Low-citation group (at or below median) |
High-citation group (at or above median) |
Article quality |
-.03 |
.33 |
(382) |
(191) |
(191) |
Article impact |
.03 |
.36 |
(380) |
(188) |
(192) |
Note. Cell Ns are in parentheses. |
Summary and Conclusions
Issues of scientific quality have long been a major concern of our enterprise. This article has presented a series of exploratory studies designed to investigate three major aspects of our evaluation system: (a) criteria upon which assessments of the quality of articles may be made, (b) the reliability of peer judgments of article quality, and (c) the relations between peer judgments of article quality and the number of citations made to articles following publication.
As discussed earlier, one way to view the results of the study of evaluative criteria is in terms of a set of prescriptive norms for assessment. Prescriptive norms (Merton, 1957; Storer, 1966) can be thought of as idealized behavior plans — they are outlines (more or less complete) that prescribe our actions. Descriptive norms, on the other hand, are summaries of actual behavior. Correspondence between behavior plans and actual behavior, of course, may be less than perfect.
Despite the considerable agreement demonstrated here on prescriptive norms for the assessment of manuscripts and articles, and despite the demonstrated importance of peer-evaluation processes to the maintenance of our social enterprise (Brackbill & Korton, 1970; Garvey & Gottfredson, 1976), studies of peer-evaluation processes in psychology have generally offered a dismal picture. Despite some limitations, the present study has demonstrated greater reliability for peer judgments of quality in science than has been previously found (Bowen et al., 1972; McReynolds, 1971; Scott, 1974). While this may be due in part to the use of empirically derived criteria for evaluation, it may also be due to differences in goals, for editors may in fact attempt to select reviewers who are known to have opposing viewpoints with respect to issues treated in a given manuscript (Scott, 1974). While this would be expected to reduce reliability (in the sense of agreement across reviewers), it could also be expected to increase the adequacy of our reviewing process.
Stable and relatively high correlations have consistently been found between the number of citations of a scientist's work and various measures of the "success" of the scientist. Results of this study have indicated that (a) citations of specific articles are only very modestly correlated with peer judgments of the quality and impact of those articles, and (b) this relation is due almost exclusively to association in the higher (above the median) ranges of the citation measure.
It has been proposed that science policymakers and scientific funding agencies might make use of an index of scientific quality based on citations in both the evaluation of funded research and in the determination of which (or whose) research to fund (Wade, 1975). It has been proposed that the use of such an index may also be of aid in decision making on matters such as tenure (Geller, DeCani, & Davies, Note 1). Clearly, if practical decisions affecting not only individual scientists but the very nature and directions of future scientific research are to be made on the basis of these measures, it is imperative that we understand them better than we do at present. As Norman Hackerman, Chairman of the National Science Board, has noted in congressional testimony, "At this point, there is some risk in reading too much, too soon, into science indicators and using them for policy purposes where they are not yet appropriate" (Hearings Before the Subcommittee, 1976).
FOOTNOTES
1. In delineating evaluation criteria, we need to distinguish clearly between prescriptive and descriptive norms. Prescriptive norms can be thought of as idealized behavior plans — they are outlines (more or less complete) that prescribe our actions. Descriptive norms, on the other hand, are summaries of actual behavior. Correspondence between the two, of course, may be less than perfect.
2. In 1962 (Reports of the American Psychological Association's Project on Scientific Information Exchange in Psychology, 1962), it was found that 27 journals were necessary to cover the "core" of the psychological literature, and Psychological Abstracts regularly searches over 600 journals. Recent figures from the Institute for Scientific Information report 98 psychological journals covered by the Science Citation Index. All of the above journals, however (with the exception of Psychometrika), are published by the APA and are considered highly prestigious.
3. The configural representation of a data matrix can of course vary considerably with the information contained in the input matrix coefficient (cf. Torgerson, 1968). In general, the most commonly used matrices for factor analysis are the product-moment correlation matrix and the variance — covariance matrix. The correlation coefficient standardizes variables with respect to both the mean and the variance, while the covariance allows item variances to differ, standardizing only with respect to means. Since item means are the issue of importance here, the coefficient used is a cross product standardized with respect to vector length. Further, to minimize artificial inflation of the coefficient due to arbitrary scale value assignment, the grand mean is also extracted. The coefficient described is given by
which results in a symmetric matrix of cross products ranging from - 1.00 to + 1.00, with unity on the major diagonal. Special acknowledgment is due Warren S. Torgerson for its development. For a more complete description, see Gottfredson (1977).
4. It should be noted that the n/variable ratio for the proportionate analyses is somewhat low. However, analyses of the full data set (N = 299) confirm results of the proportionate analysis (see Footnote 5).
5. The group membership classification is not truly exclusive. Cases arose in which an individual reviewed for more than one journal, and group membership was randomly assigned. Lack of independence could obscure group differentiation. However, item variances for the full sample are relatively small and homogeneous, indicating a lack of substantial individual differences with respect to the items. Lack of independence is thus rendered less important than would otherwise be the case. Additionally, a Varimax rotation of the 10-component solution from the full data set (N = 299) almost completely confirms that of the proportionate analyses. Again, this suggests (given the "presumed" heterogeneity of the full set) a "no differences" hypothesis. Finally, it should be noted that the nature of these groups, including the lack of independence, is a fact of the real world. Although one could force independence in the present instance (by eliminating from the sample those identified with more than one journal), one would have no assurance that true independence had been achieved. In fact, if we could assure true independence, the resultant constitution of the groups would be so far removed from reality as to render the results meaningless.
6. These 148 articles were therefore excluded from further consideration. As outlined in Gottfredson (1977), however, little response bias resulted.
7. This procedure can create another problem, however, as correlations based on these means can be inflated. To investigate the extent of this inflation, the coefficients (described earlier) based on the simple sums for those responding to all items were compared with these same coefficients based on the means. The largest difference observed was .03. That there is little change is a reflection of (a) the small overall amount of missing information, and (b) interscale homogeneity of items.
8. Interestingly, the transformation had little effect; the largest difference between coefficients based on raw scores and those based on the transformed data was .06. In part, this simply' reflects the lack of correlation. Were these relations higher (and linear), we would expect more of an effect.
REFERENCE NOTE
1. Geller, N. L., DeCani, J., & Davies, R. Lifetime citation rates as a basis for assessing the quality of scientific work. Paper presented at a National Science Foundation/Institute for Scientific Information conference on the use of citation indexes in sociological research, Belmont, Maryland, April 1975.
REFERENCES
Bayer, A. E., & Folger, J. Some correlates of a citation measure of productivity in science. Sociology of Education, 1966, 39, 381-390.
Bowen, D. D., Perloff, R., & Jacoby, J. Improving manuscript evaluation procedures. American Psychologist, 1972, 27, 22 1-225.
Brackbill, V., & Korton, F. Journal reviewing practices: Authors' and APA members' suggestions for revision. American Psychologist, 1970, 25, 937-940.
Buss, A. R., & McDermott, J. R. Ratings of psychology journals compared to objective measures of journal impact. American Psychologist, 1976, 31, 675-678.
Chase, J. Normative criteria for scientific publication. American Sociologist, 1970, 8, 187-189.
Chubin, D. On the use of the Science Citation Index in sociology. American Sociologist, 1973, 8, 187-191.
Clark, K. E. America's psychologists: A survey of a growing profession. Washington, D.C.: American Psychological Association, 1957.
Cole, J., & Cole, S. Measuring the quality of sociological research: Problems in the use of the Science Citation Index. American Sociologist, 1971, 6, 23-29.
Cole, S., & Cole, 3. Scientific output and recognition: A study in the operation of the reward system in science. American Sociological Review, 1967, 32, 377-390.
Crane, D. Scientists at major and minor universities: A study of productivity and recognition. American Sociological Review, 1965, 30, 699-714.
Dennis, W. Productivity among American psychologists. American Psychologist, 1954, 9, 191-194.
Diederich, G. W., Messick, S. 3., & Tucker, L. R. A general least squares solution for successive intervals. Psychometrika, 1957, 22, 159-173.
Frantz, T. T. Criteria for publishable manuscripts. Personnel and Guidance Journal, 1968, 47, 384-386.
Garfield, E. Citation index for studying science. Nature, 1970, 227, 669-671.
Garfield, E. Citation analysis as a tool in journal evaluation. Science, 1972, 178, 471-479.
Garfield, E. & Sher, I. H. New factors in the evaluation of scientific literature through citation indexing. American Documentation, July 1963, pp. 195-201.
Garvey, W. D., & Gottfredson, S. D. Changing the system: Innovations in the interactive social system of scientific communication. Journal of Information Processing and Management, 1976, 12, 165-176.
Garvey, W. D., Lin, N., & Nelson, C. E. Communication in the physical and social sciences: The process of disseminating and assimilating information differs in these two groups of sciences. Science, 1970, 170, 1166-1173.
Gottfredson, S. D. Scientific quality and peer-group consensus (Doctoral dissertation, Johns Hopkins University, 1977). Dissertation Abstracts International, 1977, 38, 1950B. (University Microfilms No. 77-19, 588)
Gottfredson, S. D., Garvey, W. D., & Goodnow, 3. E., II. Quality indicators in the scientific journal article publication process. JSAS Catalog of Selected Documents in Psychology, 1977, 7, 74. (Ms. No. 1527).
Gross, P., & Gross, E. M. College libraries and chemical education. Science, 1927, 66, 385-389.
Hagstrom, W. 0. Inputs, outputs, and the prestige of university science departments. Sociology of Education, 1971, 44, 375-397.
Hearings Before the Subcommittee on Domestic and International Scientific Planning and Analyses of the Committee on Science and Technology, U.S. House of Representatives. 94th Congress, 2nd Session, No. 95. Washington, D.C.: U.S. Government Printing Office, 1976.
Inhaber, H. Is there a pecking order in physics journals? Physics Today, May 1974, pp. 39-43.
Kuhn, T. The structure of scientific revolutions. Chicago: University of Chicago Press, 1962.
Lewis, L. S. On subjective and objective rankings of sociology departments. American Sociologist, 1968, 3, 12 9-13 1.
Lindsey, D. Distinction, achievement, and editorial board membership. American Psychologist, 1976, 31, 799-804.
Manis, J. G. Some academic influences on publication productivity. Social Forces, 1951, 29, 267-272.
Margolis, J. Citation indexing and evaluation of scientific papers. Science, 1967, 155, 1213-1219.
McReynolds, P. Reliability of ratings of research papers. American Psychologist, 1971, 26, 400-401.
Meadows, A. J. Communication in science. London: Butterworth, 1974.
Meltzer, B. M. The productivity of social scientists. American Journal of Sociology, 1949, 55, 25-29.
Menard, H. W. Science: Growth and change. Cambridge, Mass.: Harvard University Press, 1971.
Merton, R. K. Social theory and social structure. Glencoe, Ill.: Free Press, 1957.
Polanyi, M. The potential theory of adsorption. Science, 1963, 141, 1010-1013.
Price, D. J. deS. Networks of scientific papers. Science, 1965, 149, 655-657.
Raisig, L. M. Mathematical evaluation of the scientific serial. Science, 1960, 131, 1417-1419.
Reports of the American Psychological Association's project on scientific information exchange in psychology (Vol. 1). Washington, D.C.: American Psychological Association, 1962.
Roe, A. A psychological study of eminent biologists. Psychological Monographs, 1951, 65 (14, Whole No. 331).
Roe, A. The psychology of the scientist. Science, 1961, 134, 456-459.
Roe, A. Psychological approaches to creativity in science. In M. A. Coler (Ed.), Essays on creativity in the sciences. New York: New York University Press, 1963.
Roe, A. Changes in scientific activities with age. Science, 1965, 150, 313-3 18.
Scott, W. A. Inter-referee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology. American Psychologist, 1974, 29, 698-702.
Storer, N. W. The social system of science. New York: Holt, Rinehart & Winston, 1966.
Summary report of journal operations for 1975. American Psychologist, 1976, 31, 468.
Torgerson, W. S. Theory and methods of scaling. New York: Wiley, 1958.
Torgerson, W. S. Multidimensional representation of similarity structures. In M. K. Katz, J. L. Cole, & W. E. Barton (Eds.), Classification in psychiatry and psychopathology. Washington, D.C.: U.S. Government Printing Office, 1968.
Wade, N. Citation analysis: A new tool for science administrators. Science, 1975, 188, 429-432.
Westbrook, N. Identifying significant research. Science, 1960, 132, 1229-1234.
Wolff, W. M. A study of criteria for journal manuscripts. American Psychologist, 1970, 25, 636-639.
Wolff, W. M. Publication problems in psychology and an explicit evaluation schema for manuscripts. American Psychologist, 1973, 28, 257-261.
Xhigness, L. W., & Osgood, C. E. Bibliographic citation characteristics of the psychological journal network in 1950 and 1960. American Psychologist, 1967, 22, 778-791.
Zuckerman, H. Nobel laureates in science: Patterns of productivity, collaboration, and authorship. American Sociological Review, 1967, 32, 391-403.