Brookhart Et Al 2016 RER 100 Year Grades Review
Brookhart Et Al 2016 RER 100 Year Grades Review
ABSTRACT:1
Grading refers to the symbols assigned to individual pieces Grading is important to study because of the centrality of
of student work or to composite measures of student grades in the educational experience of all students.
performance on report cards. This review of over 100 Grades are widely perceived to be what students “earn” for
years of research on grading considers five types of their achievement (Brookhart, 1993, p.139), and have
studies: (a) early studies of the reliability of grades, (b) pervasive influence on students and schooling (Pattison,
quantitative studies of the composition of K-12 report card Grodsky, & Muller, 2013). Furthermore, grades predict
grades, (c) survey and interview studies of teachers’ important future educational consequences, such as
perceptions of grades, (d) studies of standards-based dropping out of school (Bowers, 2010a; Bowers & Sprott,
grading, and (e) grading in higher education. Early 20th 2012; Bowers, Sprott, & Taff, 2013), applying and being
century studies generally condemned teachers’ grades as admitted to college, and college success (Atkinson &
unreliable. More recent studies of the relationships of Geiser, 2009; Bowers, 2010a; Thorsen & Cliffordson,
grades to tested achievement and survey studies of 2012). Grades are especially predictive of academic
teachers’ grading practices and beliefs suggest that grades success in more open admissions higher education
assess a multidimensional construct containing both institutions (Sawyer, 2013).
cognitive and non-cognitive factors reflecting what
teachers value in student work. Implications for future Purpose of This Review and Research Question
research and for grading practices are discussed. This review synthesizes findings from five types of grading
studies: (a) early studies of the reliability of grades on
Keywords: grading, classroom assessment, educational student work, (b) quantitative studies of the composition of
measurement K-12 report card grades and related educational outcomes,
(c) survey and interview studies of teachers’ perceptions of
Grading refers to the symbols assigned to individual pieces grades and grading practices, (d) studies of standards-based
of student work or to composite measures of student grading (SBG) and the relationship between students’
performance on student report cards. Grades or marks, as report card grades and large-scale accountability
they were referred to in the first half of the 20th century, assessments, and (e) grading in higher education. The
were the focus of some of the earliest educational research. central question underlying all of these studies is “What do
Grading research history parallels the history of grades mean?” In essence, this is a validity question (Kane,
educational research more generally, with studies 2006; Messick, 1989). It concerns whether evidence
becoming both more rigorous and sophisticated over time. supports the intended meaning and use of grades as an
educational measure. To date, several reviews have given
partial answers to that question, but none of these reviews
1 synthesize 100 years of research from five types of studies.
This document is a pre-print of this manuscript, published
The purpose of this review is to provide a more
in the journal Review of Educational Research. Citation:
comprehensive and complete answer to the research
Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan,
question “What do grades mean?”
J. H., Smith, J. K., Smith, L. F., Stevens, M.T., Welsh, M.
E. (2016). A Century of Grading Research: Meaning and
BACKGROUND:
Value in the Most Common Educational Measure. Review
The earliest research on grading concerned mostly the
of Educational Research, 86(4), 803-848.
reliability of grades teachers assigned to students’ work.
doi: 10.3102/0034654316672069
The earliest investigation of which the authors are aware
http://doi.org/10.3102/0034654316672069
was published in the Journal of the Royal Statistical 443 schools studied, 8 percent employed descriptive
Society. Edgeworth (1888) applied the “Theory of Errors” grading, 9 percent percentage grading, 31 percent
(p. 600) based on normal curve theory to the case of percentage-equivalent categorical grading, 54 percent
grading examinations. He described three different sources categorical grading that was not percentage-equivalent, and
of error: (a) chance; (b) personal differences among 2 percent “gave a general rating on some basis such as
graders regarding the whole exam (severity or leniency and ‘degree to which the pupil is working to capacity’” (Hill,
speed) and individual items on the exam, now referred to 1935, p. 119). By the 1940s, more than 80 percent of U. S.
as task variation; and (c) “taking his [the examinee’s] schools had adopted the A–F grading scale. A–F remained
answers as representative of his proficiency” (p. 614), now the most commonly used scale until the present day.
referred to as generalizing to the domain. In parsing these Current grading reforms move in the direction of SBG, a
sources of error, Edgeworth went beyond simple chance relatively new and increasingly common practice
variation in grades to treat grades as subject to multiple (Grindberg, 2014) in which grades are based on standards
sources of variation or error. This nuanced view, which for achievement. In SBG, work habits and other non-
was quite advanced for its time, remains useful today. achievement factors are reported separately from
Edgeworth pointed out the educational consequences of achievement (Guskey & Bailey, 2010).
unreliability in grading, especially in awarding diplomas,
honors and other qualifications to students. He used this METHOD:
point to build an argument for improving reliability. Literature searches for each of the five types of studies
Today, the existence of unintended adverse consequences were conducted by different groups of co-authors, using
is also an argument for improving validity (Messick, the same general strategy: (a) a keyword search of
1989). electronic databases, (b) review of abstracts against criteria
for the type of study, (c) a full read of studies that met
During the 19th century, student progress reports were criteria, and (d) a snowball search using the references
presented to parents orally by the teacher during a visit to a from qualified studies. All searches were limited to
student’s home, with little standardization of content. Oral articles published in English.
reports were eventually abandoned in favor of written
narrative descriptions of how students were performing in To identify studies of grading reliability, electronic
certain skills like penmanship, reading, or arithmetic searches using the terms “teachers’ marks (or marking)”
(Guskey & Bailey, 2001). In the 20th century, high school and “teachers’ grades (or grading)” were conducted in the
student populations became so diverse and subject area following databases: ERIC, the Journal of Educational
instruction so specific that high schools sought a way to Measurement (JEM), Educational Measurement: Issues
manage the increasing demands and complexity of and Practice (EMIP), ProQuest’s Periodicals Index Online,
evaluating student progress (Guskey & Bailey, 2001). and the Journal of Educational Research (JER). The
Although elementary schools maintained narrative criterion for inclusion was that the research addressed
descriptions, high schools increasingly favored percentage individual pieces of student work (usually examinations),
grades because the completion of narrative descriptions not composite report card grades. Sixteen empirical
was viewed as time-consuming and lacking cost- studies were found (Table 1).
effectiveness (Farr, 2000). One could argue that this move
to percentage grades eliminated the specific To identify studies of grades and related educational
communication of what students knew and could do. outcomes, search terms included “(grades OR marks) AND
(model* OR relationship OR correlation OR association
Reviews by Crooks (1933), Smith and Dobbin (1960), and OR factor).” Databases searched included JSTOR, ERIC,
Kirschenbaum, Napier, and Simon (1971) debated whether and Educational Full Text Wilson Web. Criteria for
grading should be norm- or criterion-referenced, based on inclusion were that the study (a) examined the relationship
clearly defined standards for student learning. Although of K-12 grades to schooling outcomes, (b) used
high schools tended to stay with norm-referenced grades to quantitative methods, and (c) examined data from actual
accommodate the need for ranking students for college student assessments rather than teacher perspectives on
admissions, some elementary school educators transitioned grading. Forty-one empirical studies were identified
to what was eventually called mastery learning and then (Tables 2, 3, and 4).
standards-based education. Based on studies of grading
reliability (Kelly, 1914; Rugg, 1918), in the 1920’s For studies of K-12 teachers’ perspectives about grading
teachers began to adopt grading systems with fewer and and grading practices, the search terms used were
broader categories (e.g., the A–F scale). Still, variation in “grade(s),” “grading,” and “marking” with “teacher
grading practices persisted. Hill (1935) found variability perceptions,” “teacher practices,” and “teacher attitudes.”
in the frequency of grade reports, ranging from 2–12 times Databases searched included ERIC, Education Research
per year, and a wide array of grade reporting practices. Of Complete, Dissertation Abstracts, and Google Scholar.
Table 1
Starch (1913) Descriptive 10 instructors grading 10 Teacher variability was large, and largest for the two poorest
statistics freshman English exams papers
Isolated four sources of variation and reported probable error (p.
632, total probable error=5.4 out of 100): 1) Differences among
the standards of different schools (probable error almost 0), (2)
Differences among the standards of different teachers (pe=1.0),
(3) Differences in the relative values placed by different
teachers upon various elements in a paper, including content
and form (pe=2.1), and (4) Differences due to the pure inability
to distinguish between closely allied degrees of merit (pe=2.2).
Starch (1915) Descriptive 12 teachers grading 24 6th Average teacher variability of 4.2 (out of 100) was reduced to
statistics and 7th grade compositions 2.8 by forcing a normal distribution using a 5-category scale
(Poor, Inferior, Medium, Superior, and Excellent)
Starch and Descriptive 142 high school English Teacher variability in assigning grades was large (a range of 30-
Elliott (1912) statistics teachers grading 2 exams 40 out of 100 points, probable error of 4.0 and 4.8, respectively)
Teacher variability in the relative sense, as well
Starch and Descriptive 138 high school Teacher variability was larger than for the English papers in
Elliott statistics mathematics teachers Starch and Elliott (1912): probable error of 7.5
(1913a) grading 1 geometry exam Grade for 1 answer varies about as widely as composite grade
for the whole exam
Starch and Descriptive 122 high school history Teacher variability was larger than for the English or math
Elliott statistics teachers grading 1 exam exams (Starch & Elliott, 1912, 1913a): probable error of 7.7
(1913b) Concluded that variability isn’t due to subject, but “the
examiner and method of examination” (p. 680)
Criteria for inclusion were that the study topic was K-12 investigated, “Differences among the standards of different
teachers’ perceptions of grading and grading practices and schools” (p. 630) contributed practically nothing toward
were published since 1994 (the date of Brookhart’s the total (p. 632).
previous review). Thirty-five empirical studies were found
(31 are presented in Table 5, and four that investigated Other studies listed in Table 1 identify these and other
SBG are in Table 6). sources of grading variability. Differences in grading
criteria, or lack of criteria, were found to be a prominent
The search for studies of standards-based grading used the source of variability in grades (Ashbaugh, 1924; Brimi,
search terms “standards” and (“grades” or “reports) and 2011; Eells, 1930; Healy, 1935; Silberstein, 1922), akin to
“education.” Databases searched included Psychinfo, Starch’s (1913) difference in the relative values teachers
Psycharticles, ERIC, and Education Source. The criterion place on various elements in a paper. Teacher severity or
for inclusion was that articles needed to address SBG. leniency was found to be another source of variability in
Eight empirical studies were identified (Table 6). grades (Shriner, 1930; Silberstein, 1922; Sims, 1933),
similar to Starch’s differences in teachers’ standards.
For studies of grading in higher education, search terms Differences in student work quality were associated with
included “grades” or “grading,” combined with variability in grades, but the findings were inconsistent.
“university,” “college,” and “higher education” in the title. Bolton (1927), for example, found greater grading
Databases searched included EBSCO Education Research variability for poorer papers. Similarly, Jacoby (1910)
Complete, ERIC, and ProQuest (Education Journals). The interpreted his high agreement as a result of the high
inclusion criterion was that the study investigated grading quality of the papers in his sample. Eells (1930), however,
practices in higher education. University websites in 12 found greater grading consistency in the poorer papers.
different countries were also consulted to allow for Lauterbach (1928) found more grading variability for
international comparisons. Fourteen empirical studies typewritten compositions than for handwritten versions of
were found (Table 7). the same work. Finally, between-teacher error was a
central factor in all of the studies in Table 1. Studies by
RESULTS: Eells and Hulten (1925) demonstrated within-teacher
Summaries of results from each of the five types of studies, error, as well.
along with tables listing those results, are presented in this
section. The Discussion section that follows synthesizes Given a probable error of around 5 in a 100-point scale,
the findings and examines the meaning of grades based on Starch (1913) recommended the use of a 9-point scale (i.e.,
that synthesis. A+, A-, B+, B-, C+, C-, D+, D-, and F) and later tested the
improvement in reliability gained by moving to a 5-point
Grading Reliability scale based on the normal distribution (Starch, 1915). His
Table 1 displays the results of studies on the reliability of and other studies contributed to the movement in the early
teachers’ grades. The main finding was that great variation 20th century away from a 100-point scale. The ABCDF
exist in the grades teachers assign to students’ work letter grade scale became more common and remains the
(Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; most prevalent grading scale in schools in the U.S today.
Hulten, 1925; Kelly, 1914; Lauterbach, 1928; Rugg, 1918;
Silberstein, 1922; Sims, 1933; Starch, 1913, 1915; Starch Grades and Related Educational Outcomes
& Elliott, 1912, 1913a,b). Three studies (Bolton, 1927; Quantitative studies of grades and related educational
Jacoby, 1910; Shriner, 1930) argued against this outcomes moved the focus of research on grades from
conclusion, however, contending that teacher variability in questions of reliability to questions of validity. Three
grading was not as great as commonly suggested. types of studies investigated the meaning of grades in this
way. The oldest line of research (Table 2) looked at the
As the work of Edgeworth (1888) previewed, these studies relationship between grades and scores on standardized
identified several sources of the variability in grading. tests of intelligence or achievement. Today, those studies
Starch (1913), for example, determined that three major would be seen as seeking concurrent evidence for validity
factors produced an average probable error of 5.4 on a 100- under the assumption that graded achievement should be
point scale across instructors and schools. Specifically, the same as tested achievement (Brookhart, 2015). As the
“Differences due to the pure inability to distinguish 20th century progressed, researchers added non-cognitive
between closely allied degrees of merit” (p. 630) variables to these studies, describing grades as
contributed 2.2 points, “Differences in the relative values multidimensional measures of academic knowledge,
placed by different teachers upon various elements in a engagement, and persistence (Table 3). A third group of
paper, including content and form” (p. 630) contributed 2.1 more recent studies looked at the relationship between
points, “Differences among the standards of different grades and other educational outcomes, for example
teachers” (p. 630) contributed 1.0 point. Although dropping out of school or future success in school
Table 2
Studies of the Relation of K-12 Report Card Grades and Tested Achievement
Table 3
Studies of K-12 Report Card Grades as Multidimensional Measures of Academic Knowledge, Engagement, and Persistence
Miner (1967) Factor 671 high school students Examined academic grades in first, third, sixth, ninth, and twelfth
Analysis grade; achievement tests in fifth, sixth, and ninth grades; and
citizenship grades in first, third, and sixth grades. A three factor
solution was identified: three factor solution: (a) objective
achievement, (b) behavior factor, and (c) high school
achievement as measured through grades.
Sobel (1936) Descriptive Not reported Students categorized into three groups based on comparing
grades and achievement test levels; grade-superior, middle-
group, mark-superior
Thorsen and Structural All grade 9 students in Sweden, Generally replicated Klapp Lekholm and Cliffordson (2009)
Cliffordson Equation 99,085 (2003), 105,697 (2004),
(2012) Modeling 108,753 (2005)
Thorsen (2014) Structural 3,855 students in Sweden Generally replicated Klapp Lekholm and Cliffordson (2009) in
Equation examining norm-referenced grades
Modeling
Willingham, Regression 8,454 students from 581schools A moderate relationship between grades and tests was identified
Pollack, and as well as strong positive relationships between grades and
Lewis (2002) student motivation, engagement, completion of work assigned,
and persistence
Table 4
Cairns, Cairns, Cluster 475 grade 7 students Beyond student demographics, student aggressiveness and low
and Neckerman analysis; levels of academic performance associated with dropping out
(1989) regression
Cliffordson Two-level 164,106 Swedish students Grades predict achievement in higher education more strongly
(2008) modeling than SweSAT (Swedish Scholastic Aptitude Test), and criterion-
referenced grades predict slightly better than norm-referenced
grades
Ekstrom, Goertz, Regression High School and Beyond Grades and problem behavior identified as the most important
Pollack, and survey, 30,000 high school variables for identifying dropping out, higher than test scores.
Rock (1986) sophomores
Ensminger and Regression 1,242 first graders from Low grades and aggressive behavior related to eventually
Slusarcick historically disadvantaged dropping out, with low SES negatively moderating the
(1992) community relationships.
Fitzsimmons, Correlation 270 high school students Students receiving low grades (D or F) in elementary or middle
Cheever, school were at much higher risk of dropping out.
Leonard, and
Macunovich
(1969)
Jimerson, Regression 177 children tracked from birth Home environment, quality of parent caregiving, academic
Egeland, Sroufe, through age 19 achievement, student problem behaviors, peer competence and
and Carlson intelligence test scores significantly related with dropping out.
(2000)
Lloyd (1978) Regression 1532 third grade students Dropping out significantly predicted with grades and marks
Morris, Ehren, Correlation; 785 in grades 7 through 12 Dropping out predicted by Absences, low grades (D or F),
and Lenz (1991) chi-square mobility.
Roderick and Regression 27,612 Chicago ninth graders Examined significant predictors of course failure, including low
Camburn (1999) attendance, and found failure rates varied significantly at the
school level
Troob (1985) Descriptive 21,000 New York city high Low grades and high absences corresponded to higher levels of
school students dropping out
(Table 4). These studies offer predictive evidence for part of researchers to understand what teacher-assigned
validity under the assumption that grades measure school grades represent in comparison to other known
success. standardized assessments. In other words, their focus was
criterion validity (Ross & Hooks, 1930).
Correlation of grades and other assessments. Table 2
describes studies that investigated the relationship between Investigations from the late 20th century and into the 21st
grades (usually grade-point average, GPA) and century replicated earlier studies but included larger, more
standardized test scores in an effort to understand the representative samples and used more current standardized
composition of the grades and marks that teachers assign to tests and methods (Brennan, Kim, Wenz-Gross, &
K-12 students. Despite the enduring perception that the Siperstein, 2001; Woodruff & Ziomek, 2004). Brennan
correlation between grades and standardized test scores is and colleagues (2001), for example, compared reading
strong (Allen, 2005; Duckworth, Quinn, & Tsukayama, scores from the Massachusetts MCAS state test to grades
2012; Stanley & Baines, 2004), this correlation is and in mathematics, English, and science and found
always has been relatively modest, in the .5 range. As correlations ranging from .54 to .59. Similarly, using GPA
Willingham, Pollack, and Lewis (2002) noted: and 2003 TerraNova Second Edition/California
Understanding these characteristics of grades is Achievement Tests, Duckworth and Seligman (2006)
important for the valid use of test scores as well as found a correlation of .66. Subsequently, Duckworth et al.
grade averages because, in practice, the two (2012) examined standardized reading and mathematics
measures are often intimately connected… [there test scores to GPA and found correlations between .62 and
is a] tendency to assume that a grade average and .66.
a test score are, in some sense, mutual surrogates;
that is, measuring much the same thing, even in Woodruff and Ziomek (2004) compared GPA and ACT
the face of obvious differences (p.2). composite scores for all high school students who took the
ACT college entrance exam between 1991 and 2003. They
Research on the relationship between grades and found moderate but consistent correlations ranging from
standardized assessment results is marked by two major .56 to .58 over the years for average GPA and composite
eras: early 20th century studies and late 20th into 21st ACT scores, from .54 to .57 for mathematics grades and
century studies. Unzicker (1925) found that average ACT scores, and from .45 to .50 in English. Student GPAs
grades across subjects correlated .47 with intelligence test were self-reported, however. Pattison and colleagues
scores. Ross and Hooks (1930) reviewed 20 studies (2013) examined four decades of achievement data on tens
conducted from 1920 through 1929 on report card grades of thousands of students using national databases to
and intelligence test scores in elementary school as compare high school GPA to reading and mathematics
predictors of junior high and high school grades. Results standardized tests. The authors found GPA correlations
showed that the correlations between grades in seventh consistent with past research, ranging from .52 to .64 in
grade and intelligence test scores ranged from .38 to .44. mathematics and from .46 to .54 in reading
Ross and Hooks concluded: comprehension.
Data from this and other studies indicate that the
grade school record affords a more reliable or Although some variability exists across years and subjects,
consistent basis of prediction than any other correlations have remained moderate but remarkably
available, the correlations in three widely-scattered consistent in studies based on large, nationally-
school systems showing remarkable stability; and representative datasets. Across 100 years of research,
that without question the grade school record of the teacher-assigned grades typically correlate about .5 with
pupil is the most usable or practical of all bases for standardized measures of achievement. In other words, 25
prediction, being available wherever cumulative percent of the variation in grades teachers assign is
records are kept, without cost and with a minimum attributable to a trait comparable to the trait measured by
expenditure of time and effort (p. 195). standardized tests (Bowers, 2011). The remaining 75
percent is attributable to something else. As Swineford
Subsequent studies moved from correlating grades and (1947) noted in a study on grading in middle and high
intelligence test scores to correlating grades with school, “the data [in the study] clearly show that marks
standardized achievement results (Carter, 1952, r = .52; assigned by teachers in this school are reliable measures of
Moore, 1939, r = .61). McCandless, Roberts, and Starnes something but there is apparently a lack of agreement on
(1972) found a smaller correlation (r = .31) after just what that something should be” (p.47) [author’s
accounting for socio-economic status, ethnicity, and emphasis]. A correlation of .5 is neither very weak—
gender. Although the sample selection procedures and countering arguments that grades are completely subjective
methods used in these early investigations are problematic measures of academic knowledge; nor is it very strong—
by current standards, they represent a clear desire on the refuting arguments that grades are a strong measure of
fundamental academic knowledge, and remains consistent finding suggests that most teachers successfully use grades
despite large shifts in the educational system, especially in to reward achievement-oriented behavior and promote a
relation to accountability and standardized testing (Bowers, widespread growth in achievement” (Kelly, 2008, p.45).
2011; Linn, 1982). Kelly also argued that misperceptions that teachers do not
distinguish between apparent and substantive engagement
Grades as multi-dimensional measures of academic lends mistaken support to the use of high-stakes tests as
knowledge, engagement, and persistence. Investigations inherently more “objective” (p. 46) than teacher
of the composition of K-12 report card grades consistently assessments.
find them to be multidimensional, comprising minimally
academic knowledge, substantive engagement, and Recent studies have expanded on this work, applying
persistence. Table 3 presents studies of grades and other sophisticated methodologies. Bowers (2009, 2011) used
measures, including many non-cognitive variables. The multi-dimensional scaling to examine the relationship
earliest study of this type, Sobel (1936) found that students between grades and standardized test scores in each
with high grades and low test scores had outstanding semester in high school, in both core subjects
penmanship, attendance, punctuality, and effort marks, and (mathematics, English, science, and social studies) and
their teachers rated them high in industry, perseverance, non-core subjects (foreign/non-English languages, art, and
dependability, co-operation, and ambition. Similarly, physical education). Bowers (2011) found evidence for a
Miner (1967) factor analyzed longitudinal data for a three factor structure: (a) a cognitive factor that describes
sample of students, including their grades in first, third, the relationship between tests and core subject grades, (b) a
sixth, ninth, and twelfth grade; achievement tests in fifth, conative and engagement factor between core subject
sixth, and ninth grades; and citizenship grades in first, grades and non-core subject grades (termed a “Success at
third, and sixth grades. She identified a three-factor School Factor, SSF,” p. 154), and (c) a factor that
solution: (a) objective achievement as measured through described the difference between grades in art and physical
standardized assessments, (b) early classroom citizenship education. He also showed that teachers’ assessment of
(a behavior factor), and (c) high school achievement as students’ ability to negotiate the social processes of
measured through grades, demonstrating that behavior and schooling represents much of the variance in grades that is
two types of achievement could be identified as separate unrelated to test scores. This points to the importance of
factors. substantive engagement and persistence (Kelly, 2008;
Willingham et al., 2002) as factors that help students in
Farkas, Grobe, Sheehan, and Shaun (1990) showed that both core and non-core subjects. Subsequently,
student work habits were the strongest non-cognitive Duckworth et al. (2012) used structural equation modeling
predictors of grades. They noted: “Most striking is the (SEM) for 510 New York City fifth through eighth graders
powerful effect of student work habits upon course to show that engagement and persistence is mediated
grades… teacher judgments of student non-cognitive through teacher evaluations of student conduct and
characteristics are powerful determinants of course grades, homework completion.
even when student cognitive performance is controlled” (p.
140). Likewise, Willingham et al. (2002), using large Casillas and colleagues (2012) examined the
national databases, found a moderate relationship between interrelationship among grades, standardized assessment
grades and tests as well as strong positive relationships scores, and a range of psychosocial characteristics and
between grades and student motivation, engagement, behavior. Twenty-five percent of the explained variance in
completion of work assigned, and persistence. Relying on GPAs was attributable to the standardized assessments; the
a theory of a conative factor of schooling—focusing on rest was predicted by a combination of prior grades (30%),
student interest, volition, and self-regulation (Snow, psychosocial factors (23%), behavioral indicators (10%),
1989)—the authors suggested that grades provide a useful demographics (9%), and school factors (3%). Academic
assessment of both conative and cognitive student factors discipline and commitment to school (i.e., the degree to
(Willingham et al., 2002). which the student is hard working, conscientious, and
effortful) had the strongest relationship to GPA.
Kelly (2008) countered a criticism of the conative factor
theory of grades, namely that teachers may award grades A set of recent studies focused on the Swedish national
based on students appearing engaged and going through context (Cliffordson, 2008; Klapp Lekholm, 2011; Klapp
the motions (i.e., a procedural form of engagement) as Lekholm & Cliffordson, 2008, 2009; Thorsen, 2014;
opposed to more substantive engagement involving Thorsen & Cliffordson, 2012), which is interesting because
legitimate effort and participation that leads to increased report cards are uniform throughout the country and
learning. He found positive and significant effects of require teachers to grade students using the same
students’ substantive engagement on subsequent grades but performance level scoring system used by the national
no relationship with procedural engagement, noting “This exam. Klapp Lekholm and Cliffordson (2008) showed that
grades consisted of two major factors: a cognitive Leonard, & Macunovich, 1969; Lloyd, 1974, 1978; Voss,
achievement factor and a non-cognitive “common grade Wendling, & Elliott, 1966) identified teacher-assigned
dimension” (p. 188). In a follow-up study, Klapp Lekholm grades as one of the strongest predictors of student risk for
and Cliffordson (2009) reanalyzed the same data, failing to graduate from high school. Subsequent studies
examining the relationships between multiple student and included other variables such as absence and misbehavior
school characteristics and both the cognitive and non- and found that grades remained a strong predictor
cognitive achievement factors. For the cognitive (Barrington & Hendricks, 1989; Cairns, Cairns, & Necker,
achievement factor of grades, student self-perception of 1989; Ekstrom, Goertz, Pollack, & Rock, 1986;
competence, self-efficacy, coping strategies, and subject- Ensminger & Slusarcick, 1992; Finn, 1989; Hargis, 1990;
specific interest were most important. In contrast, the most Morris, Ehren, & Lenz, 1991; Rumberger, 1987; Troob,
important student variables for the non-cognitive factor 1985). More recent research using a life course
were motivation and a general interest in school. These perspective showed low or failing grades have a
SEM results were replicated across three full population- cumulative effect over a student’s time in school and
level cohorts in Sweden representing all 99,085 9th grade contribute to the eventual decision to leave (Alexander,
students in 2003, 105,697 students in 2004, and 108,753 in Entwisle, & Kabbani, 2001; Jimerson, Egeland, Sroufe, &
2005 (Thorsen & Cliffordson, 2012), as well as in Carlson, 2000; Pallas, 2003; Roderick & Camburn, 1999).
comparison to both norm-referenced and criterion-
referenced grading systems, examining 3,855 students in Other research in this area considered grades in two ways:
Sweden (Thorsen, 2014). Klapp Lekholm and Cliffordson the influence of low grades (Ds and Fs) on dropping out,
(2009) wrote: and the relationship of a continuous scale of grades (such
The relation between general interest or motivation as GPA) to at-risk status and eventual graduation or
and the common grade dimension seems to dropping out. Three examples are particularly notable.
recognize that students who are motivated often Allensworth and colleagues have shown that failing a core
possess both specific and general goals and subject in ninth grade is highly correlated with dropping
approach new phenomena with the goal of out of school, and thus places a student off track for
understanding them, which is a student graduation (Allensworth, 2013; Allensworth & Easton,
characteristic awarded in grades (p. 19). 2005, 2007). Such failure also compromises the transition
from middle school to high school (Allensworth, Gwynne,
These findings, similar to those of Kelly (2008), Bowers Moore, & de la Torre, 2014). Balfanz, Herzog, and
(2009, 2011), and Casillas et al. (2012), support the idea MacIver (2007) showed a strong relationship between
that substantive engagement is an important component of failing core courses in sixth grade and dropping out.
grades that is distinct from the skills measured by Focusing on modeling conditional risk, Bowers (2010b)
standardized tests. A validity argument that expects grades found the strongest predictor of dropping out after grade
and standardized tests to correlate highly therefore may not retention was having D and F grades.
be sound because the construct of school achievement is
not fully defined by standardized test scores. Tested Few studies, however, have focused on grades as the sole
achievement represents one dimension of the results of predictor of graduation or dropping out. Most studies
schooling, privileging “individual cognition, pure instead examine longitudinal grade patterns, using either
mentation, symbol manipulation, and generalized learning” data mining techniques such as cluster analysis of all
(Resnick, 1987, pp. 13-15). course grades K-12 (Bowers, 2010a) or mixture modeling
techniques to identify growth patterns or decline in GPA in
Grades as predictors of educational outcomes. early high school (Bowers & Sprott, 2012). A recent
Table 4 presents studies of grades as predictors of review of the studies on the accuracy of dropout predictors
educational outcomes. Teacher-assigned grades are well- showed that along with the Allensworth Chicago on-track
known to predict graduation from high school (Bowers, indicator (Allensworth & Easton, 2007), longitudinal GPA
2014), as well as transition from high school to college trajectories were among the most accurate predictors
(Atkinson & Geiser, 2009; Cliffordson, 2008). identified (Bowers et al., 2013).
Satisfactory grades historically have been used as one of
the means to grant students a high school diploma Teachers’ Perceptions of Grading and Grading
(Rumberger, 2011). Studies from the second half of the Practices
20th century and into the 21st century, however, have Systematic investigations of teachers’ grading practices
focused on using grades from early grade levels to predict and perceptions about grading began to be published in the
student graduation rate or risk of dropping out of school 1980s and were summarized in Brookhart’s (1994) review
(Gleason & Dynarski, 2002; Pallas, 1989). of 19 empirical studies of teachers grading practices,
opinions, and beliefs. Five themes were supported. First,
Early studies in this domain (Fitzsimmons, Cheever, teachers use measures of achievement, primarily tests, as
major determinants of grades. Second, teachers believe it were by far most important in determining grades. These
is important to grade fairly. Views of fairness included findings have been replicated (Duncan & Noonan, 2007;
using multiple sources of information, incorporating effort, McMillan et al., 2002). In a qualitative study, McMillan
and making it clear to students what is assessed and how and Nash (2000) found that teaching philosophy and
they will be graded. This suggests teachers consider judgments about what is best for students’ motivation and
school achievement to include the work students do in learning contributes to variability of grading practices,
school, not just the final outcome. Third, in 12 of the suggesting that an emphasis on effort, in particular,
studies teachers included non-cognitive factors in grades, influences these outcomes. Randall and Engelhard (2010)
including ability, effort, improvement, completion of work, found that teacher beliefs about what best supports students
and, to a small extent, other student behaviors. Fourth, are important factors in grading, especially using non-
grading practices are not consistent across teachers, either cognitive factors for borderline grades, as Sun and Cheng
with respect to purpose or the extent to which non- (2013) also found with a sample of Chinese secondary
cognitive factors are considered, reflecting differences in teachers. These studies suggest that part of the reason for
teachers’ beliefs and values. Finally, grading practices the multidimensional nature of grading reported in the
vary by grade level. Secondary teachers emphasize previous section is that teachers’ conceptions of “academic
achievement products, such as tests; whereas, elementary achievement” include behavior that supports and promotes
teachers use informal evidence of learning along with academic achievement, and that teachers evaluate these
achievement and performance assessments. Brookhart’s behaviors as well as academic content in determining
(1994) review demonstrated an upswing in interest in grades. These studies also showed significant variation
investigating grading practices during this period, in which among teachers within the same school. That is, the
performance-based and portfolio classroom assessment weight that different teachers give to separate factors can
was emphasized and reports of the unreliability of vary a great deal within a single elementary or secondary
teachers’ subjective judgments about student work also school (Cizek et al., 1995; Cross & Frary, 1999; Duncan &
increased. The findings were in accord with policy- Noonan, 2007; Guskey, 2009b; Troug & Friedman, 1996;
makers’ increasing distrust of teachers’ judgments about U.S. Department of Education, 1999; Webster, 2011).
student achievement.
Teacher perceptions about grading. Compared to the
Teachers’ reported grading practices. Empirical studies number of studies about teachers’ grading practices,
of teachers’ grading practices over the past twenty years relatively few studies focus directly on perceptual
have mainly used surveys to document how teachers use constructs such as importance, meaning, value, attitudes,
both cognitive and non-cognitive evidence, primarily and beliefs. Several studies used Brookhart’s (1994)
effort, and their own professional judgment in determining suggestion that Messick’s (1989) construct validity
grades. Table 5 shows most studies published since framework is a reasonable approach for investigating
Brookhart’s 1994 review document that teachers in perceptions. This focuses on both the interpretation of the
different subjects and grade levels use “hodgepodge” construct (what grading means) and the implications and
grading (Brookhart, 1991, p. 36), combining achievement, consequences of grading (the effect it has on students).
effort, behavior, improvement, and attitudes (Adrian, 2012; Sun and Cheng (2013) used this conceptual framework to
Bailey, 2012; Cizek, Fitzgerald, & Rachor, 1995; Cross & analyze teachers’ comments about their grading and the
Frary, 1999; Duncan & Noonan, 2007; Frary, Cross, & extent to which values and consequences were considered.
Weber, 1993; Grimes, 2010; Guskey, 2002, 2009b; The results showed that teachers interpreted good grades as
Imperial, 2011; Liu, 2008a; Llosa, 2008; McMillan, 2001; a reward for accomplished work, based on both effort and
McMillan & Lawson, 2001; McMillan, Myran, & quality, student attitude toward achievement as reflected by
Workman, 2002; McMillan & Nash, 2000; Randall & homework completion, and progress in learning. Teachers
Engelhard, 2009, 2010; Russell & Austin, 2010; Sun & indicated the need for fairness and accuracy, not just
Cheng, 2013; Svennberg, Meckbach, & Redelius, 2014; accomplishment, saying that grades are fairer if they are
Troug & Friedman, 1996; Yesbeck, 2011). Teachers’ lowered for lack of effort or participation, and that grading
often make grading decisions with little school or district needs to be strict for high achievers. Teachers also
guidance. considered consequences of grading decisions for students’
future success and feelings of competence.
Teachers distinguish among non-achievement factors in
grading. They view “academic enablers” (McMillan, Fairness in an individual sense is a theme in several studies
2001, p. 25), including effort, ability, work habits, of teacher perceptions of grades (Bonner & Chen, 2009;
attention, and participation, differently from other non- Grimes, 2010; Hay & MacDonald, 2008; Kunnath, 2016;
achievement factors, such as student personality and Sun & Cheng, 2013; Svennberg et al., 2014; Tierney,
behavior. McMillan, consistent with earlier research, Simon, & Charland, 2011). Teachers perceive grades to
found that academic performance and academic enablers have value according to what they can do for individual
Table 5
norm-referenced
Frary, Cross, Survey; 536 secondary teachers Up to 70% of teachers agreed that ability, effort, and improvement
and Weber descriptive should be used for grading
(1993)
Grimes Survey; 199 middle school Grades should be based on both achievement and non-achievement
(2010) descriptive teachers factors, including improvement, mastery, and effort
Guskey Survey; 94 elementary and 112 70% of teachers reported an ideal grade distribution of 41% As,
(2002) descriptive secondary teachers 29%Bs, and 19% Cs, but with significant variation
Teachers wanted students to obtain the highest grade possible
Highest ranked purpose was to communicate to parents, then to use
as feedback to students
Multiple factors used to determine grades, including homework,
effort, and progress
Guskey Survey; 513 elementary and Significant variation in grading practices and issues were reported
(2009b) descriptive secondary teachers. Most agreed learning occurs without grading
50% averaged multiple scores to determine grades
73% based grades on criteria, not norm
Grades used for communication with students and parents
Hay and Interviews and Two high school Teachers’ values and experience influenced internalization of criteria
MacDonald observations teachers important for grading, resulting in varied practices
(2008)
Imperial Survey; 411 high school Teachers reported a wide variety of grading practices; whereas the
(2011) descriptive teachers primary purpose was to indicate achievement, about half used non-
cognitive factors
Grading was unrelated to training received in recommended
grading practices
Kunnath Mixed methods 251 high school Teachers used both objective achievement results and subjective
(2016) teachers factors in grading
Teachers incorporated individual circumstances to promote the
highest grades possible
Grading was based on teachers’ philosophy of teaching
Brookhart et al. (2016) A Century of Grading Research
19
Liu (2008a) Survey; 52 middle and 55 high Most teachers used effort, ability, and attendance/participation in
multivariate school teachers grading, with few differences between grade levels
analyses 40% used classroom behavior
90% used effort
65% used ability
75% used attendance/participation
Liu (2008b) Survey; factor 300 middle and high Six components in grading were confirmed: importance/value,
analysis school teachers feedback for motivation, instruction, and improvement,
effort/participation, ability and problem solving, comparisons/extra
credit, and grading self-efficacy/ease/confidence/accuracy
Llosa (2008) Survey; factor 1,224 elementary While showing variations in interpreting English proficiency
analysis; verbal teachers standards, teachers’ grading supported valid summative judgments
protocol analysis though weak formative use for improving instruction
Teachers incorporated student personality and behavior in grading
McMillan Survey; 1,483 middle and high Significant variation in weight given to different factors, with a
(2001) descriptive; school teachers high percentage of teachers using non-cognitive factors
factor analysis Four components of grading were identified: academic enabling
non-cognitive factors, achievement, external comparisons, use of
extra credit, with significant variation among teachers
McMillan Survey; 213 secondary science Teachers reported use of both cognitive and non-cognitive factors in
and Lawson descriptive teachers grading, especially effort
(2001)
McMillan, Survey; factor 901elementary school Five components were confirmed, including academic enablers
Myran, and analysis teachers such as improvement and effort, extra credit, achievement,
Workman homework, and external comparisons
(2002) 70% indicated use of effort, improvement and ability
No differences between math and language arts teachers
High variability in how much different factors are weighted
McMillan Interviews 24 elementary and Found that teaching philosophy and student effort that improves
and Nash secondary math and motivation and learning were very important considerations for
(2000) English teachers grading
Randall and Survey; 800 elementary, 800 Achievement was the most important factor; effort and behavior
Engelhard scenarios; middle, and 800 high provided as feedback; little emphasis on ability
(2009) descriptive; school teachers
Rasch modeling
Randall and Survey; 79 elementary, 155 Achievement was the most important factor; use of effort and
Engelhard scenarios; middle, and 108 high classroom behavior for borderline cases
(2010) descriptive school teachers
Russell and Survey; 352 secondary music Non-cognitive factors, such as performance/skill,
Austin descriptive teachers attendance/participation, attitude, and practice/effort weighted as
(2010) much or more than achievement.
In high school there was a greater emphasis on attendance; middle
school more on practice.
Simon, Case study One high school math Found standardized grading policies conflicted with professional
Tierney, teacher judgments
Forgette-
Giroux,
Charland,
Noonan, and
Duncan
(2010)
Sun and Survey 350 English language Found emphasis on individualized use of grades for motivation and
Cheng scenarios; secondary teachers extensive use of non-cognitive factors and fairness, especially for
(2013) descriptive borderline grades and for encouragement and effort attributions to
benefit students
Teachers placed more emphasis on non-achievement factors, such
as effort, homework and study habits, than achievement
Svennberg, Interviews Four physical education Identified knowledge/skills, motivation, confidence, and interaction
Meckbach, teachers with others as important factors
and Redelius
(2014)
Tierney, Mixed methods 77 high school math Most teachers believed in fair grading practices that stressed
Simon, and teachers improvement, with little emphasis on attitude, motivation, or
Charland participation, with differences individualized to students
(2011) Effort was considered for borderline grades
Troug and Mixed methods 53 high school teachers Found significant variability in grading practices and use of both
Friedman achievement and non-achievement factors
(1996)
Webster Mixed methods 42 high school teachers Teachers reported multiple purposes and inconsistent practices while
(2011) showing a clear desire to focus most on achievement consistent with
standards
Wiley (2011) Survey; 15 high school teachers Teachers varied in how much non-achievement factors were used
scenarios; for grading
descriptive Found greater emphasis on non-achievement factors, especially
effort for low ability or low achieving students
Yesbeck Interviews 10 middle school Found that a multitude of both achievement and non-achievement
(2011) language arts teachers factors were included in grading
students. Many teachers use their understanding of ordered categories (e.g., below basic, basic, proficient,
individual student circumstances, their instructional advanced), and involve separate reporting of work habits
experience, and perceptions of equity, consistency, and behavior (Brookhart, 2011; Guskey, 2009a; Guskey &
accuracy, and fairness to make professional judgments, Bailey, 2001, 2010; Marzano & Heflebower, 2011;
instead of solely relying on a grading algorithm. This McMillan, 2009; Melograno, 2007; Mohnsen, 2013;
suggests that grading practices may vary within a single O’Connor, 2009; Scriffiny, 2008; Shippy, Washer, &
classroom, just as it does between teachers, and that this is Perrin, 2013; Wiggins, 1994). It is differentiated from
valued at least by some teachers as a needed element of standardized grading, which provides teachers with
accurate, fair grading, not a problem. In contrast, Simon et uniform grading procedures in an attempt to improve
al. (2010) reported in a case study of one high school consistency in grading methods, and from mastery
mathematics teacher in Canada that standardized grading grading, which expresses student performance on a variety
policy often conflicted with professional judgment and had of skills using a binary mastered/not mastered scale
a significant impact on determining students’ final grades. (Guskey & Bailey, 2001). Some also assert that SBG can
This reflects the impact of policy in that country, an provide exceptionally high-quality information to parents,
important contextual influence. teachers, and students and therefore SBG has the potential
to bring about instructional improvements and larger
Some researchers (Liu, 2008b; Liu, O’Connell, & educational reforms. Some urge caution, however. Cizek
McCoach, 2006; Wiley, 2011) have developed scales to (2000), for example, warned that SBG may be no better
assess teachers’ beliefs and attitudes about grading, than other reporting formats and subject to the same
including items that load on importance, usefulness, effort, misinterpretations as other grading scales.
ability, grading habits, and perceived self-efficacy of the
grading process. These studies have corroborated the Literature on SBG implementation recommendations is
survey and interview findings about teachers’ beliefs in extensive, but empirical studies are few. Studies of SBG to
using both cognitive and non-cognitive factors in grading. date have focused mostly on the implementation of SBG
reforms and the relationship of standards-based grades to
Guskey (2009b) found differences between elementary and state achievement tests designed to measure the same or
secondary teachers in their perspectives about purposes of similar standards. One study investigated student, teacher,
grading. Elementary teachers were more likely to view and parent perceptions of SBG. Table 6 presents these
grading as a process of communication with students and studies.
parents and to differentiate grades for individual students.
Secondary teachers believed that grading served a Implementation of SBG. Schools, districts, and teachers
classroom control and management function, emphasizing have experienced difficulties in implementing SBG
student behavior and completion of work. (Clarridge & Whitaker, 1994; Cox, 2011; Hay &
McDonald, 2008; McMunn, Schenck, & McColskey, 2003;
In short, findings from the limited number of studies on Simon et al., 2010; Tierney et al., 2011). The
teacher perceptions of grading are largely consistent with understanding and support of teachers, parents, and
findings from grading practice surveys. Some studies have students is key to successful implementation of SBG
successfully explored the basis for practices and show that practices, especially grading on standards and separating
teachers view grading as a means to have fair, achievement grades from learning skills (academic
individualized, positive impacts on students’ learning and enablers). Although many teachers report that they support
motivation, and to a lesser extent, classroom control. such grading reforms, they also report using practices that
Together, the research on grading practices and perceptions mix effort, improvement, or motivation with academic
suggests the following four clear and enduring findings. achievement (Cox, 2011; Hay & McDonald, 2008;
First, teachers idiosyncratically use a multitude of McMunn et al., 2003). Teachers also vary in implementing
achievement and non-achievement factors in their grading SBG practices (Cox, 2011), especially in the use of
practices to improve learning and motivation as well as common assessments, minimum grading policies,
document academic performance. Second, student effort accepting work late with no penalty, and allowing students
is a key element in grading. Third, teachers advocate for to retest and replace poor scores with retest scores.
students by helping them achieve high grades. Finally,
teacher judgment is an essential part of fair and accurate The previous section summarized two studies of grading
grading. practices in Ontario, Canada, which adopted SBG
province-wide and required teachers to grade students on
Standards-Based Grading specific topics within each content area using percentage
SBG recommendations emphasize communicating student grades. Simon et al. (2010) identified tensions between
progress in relation to grade-level standards (e.g., adding provincial grading policies and one teacher’s practice.
fractions, computing area) that describe performance using Tierney and colleagues (2011) found that few teachers
Table 6
were aware of and applying provincial SBG policies. This found stronger SBG-test correlations in mathematics than
is consistent with McMunn and colleagues’ (2003) in reading or writing, and grades tended to be higher than
findings, which showed that changes in grading practice do test scores, with the exception of writing scores at some
not necessarily follow after changes in grading policy. grade levels.
SBG as a communication tool. Swan, Guskey, and Jung Grading in Higher Education
(2010, 2014) found that parents, teachers, and students Grades in higher education differ markedly among
preferred SBG over traditional report cards, with teachers countries. As a case in point, four dramatic differences
considering adopting SBG having the most favorable exist between the U.S. and New Zealand. First, grading
attitudes. Teachers implementing SBG reported that it practices are much more centralized in New Zealand where
took longer to record the detailed information included in grading is fairly consistent across universities and highly
the SBG report cards but felt the additional time was consistent within universities. Second, the grading scale
worthwhile because SBGs yielded higher-quality starts with a passing score of 50 percent, and 80 percent
information. An earlier informal report by Guskey (2004) and above score an A. Third, essay testing is more
found, however, that many parents attempted to interpret prevalent in New Zealand than multiple choice testing.
nearly all labels (e.g., below basic, basic, proficient, Fourth, grade distributions are reviewed and grades of
advanced) in terms of letter grades. It may be that a individual instructors are considered each semester at
decade of increasing familiarity with SBG has changed departmental-level meetings. These are at best rarities in
perceptions of the meaning and usefulness of SBG. higher education in the U.S.
Relationship of SBGs to high-stakes test scores. One An examination of 35 country and university websites
might expect consistency between SBGs and standards- paints a broad picture of the diversity in grading practices.
based assessment scores because they purport to measure Many countries use a system like that in New Zealand, in
the same standards. Eight papers examined this which 50 or 51 is the minimal passing score, and 80 and
consistency (Howley, Kusimo, & Parrott, 1999; Klapp above (sometimes 90 and above) is considered A level
Lekholm, 2011; Klapp Lekholm & Cliffordson, 2008, performance. Many countries also offer an E grade, which
2009; Ross & Kostuch, 2011; Thorsen & Cliffordson, is sometimes a passing score and other times indicates a
2012; Welsh & D’Agostino, 2009; Welsh, D’Agostino, & failure less egregious than an F. If 50 percent is considered
Kaniskan, 2013). All yielded essentially the same results: passing, then skepticism toward multiple choice testing
SBGs and high-stakes, standards-based assessment scores (where there is often a 1 in 4 chance of a correct guess)
were only moderately related. Howley et al. (1999) found becomes understandable. In the Netherlands, a 1 (lowest)
that 50 percent of the variance in GPA could be explained to 10 (highest) system is used, with grades 1–3 and 9–10
by standards-based assessment scores, and the magnitude rarely awarded, leaving a five-point grading system for
of the relationship varied by school. Interview data most students (Nuffic, 2013). In the European Union,
revealed that even in SBG settings, some teachers still differences between countries are so substantial that the
included non-cognitive factors (e.g., attendance and European Credit Transfer and Accumulation System was
participation) in grades. This may explain the modest created (European Commission, 2009).
relationship, at least in part.
Grading in higher education varies within countries, as
Welsh and D’Agostino (2009) and Welsh et al. (2013) well. In the U.S., it is typically seen as a matter of
developed an Appraisal Scale that gauged teachers’ efforts academic freedom and not a fit subject for external
to assess and grade students on standards attainment. This intervention. Indeed, in an analysis of the American
10-item measure focused on the alignment of assessments Association of Collegiate Registrars and Admissions
with standards and on the use of a clear, standards- Officers (AACRAO) survey of grading in higher education
attainment focused grading method. They found small to in the U.S., Collins and Nickel (1974) reported “…there
moderate correlations between this measure and grade-test are as many different types of grading systems as there are
score convergence. That is, the standards-based grades of institutions” (p. 3). The 2004 version of the same survey
teachers who utilized criterion-referenced achievement suggested, however, a somewhat more settled situation in
information were more related to standards-based recent years (Brumfield, 2005). Grading in higher
assessments than were the grades of teachers who do not education shares many issues of grade meaning with the K-
follow this practice. Welsh and D’Agostino (2009) and 12 context, which have been addressed above. Two unique
Welsh et al. (2013) found that SBG-test score relationships issues for grade meaning remain: grading and student
were larger in writing and mathematics than in reading. In course evaluations, and historical changes in expected
addition, although teachers assigned lower grades than test grade distributions. Table 7 presents studies in these areas.
scores in mathematics, grades were higher than test scores
in reading and writing. Ross and Kostuch (2011) also
Table 7
Kasten and Experimental 77 graduate students in 5 Random assignment to 3 purposes for the course evaluation
Young educational administration (personal decision, instructor’s use, or no purpose stated) yielded no
(1983) classes significant differences in ratings
Kulick and Monte Carlo Series of simulations based Normal distributions of test scores do not necessarily provide
Wright simulation on 400 students evidence of the efficacy of the evaluation of the quality of the test
(2008)
Maurer Experimental 642 students in 17 Students were randomly assigned to 3 conditions (personnel
(2006) (unspecified) classes taught decision, course improvement, or control group) and asked for
by the same instructor expected grades; expected grade was related to course evaluations
but stated purpose of the evaluation was not
Mayo (1970) Survey 3 instructors of an In a mastery learning context, active participation with course
undergraduate introductory material appear to be superior to only doing the reading and
measurement course receiving lectures
Nicolson Survey 64 colleges approved by 36 of the colleges used a 5-division marking scale for grading
(1917) the Carnegie Foundation purposes
Salmons Non- 444 introductory Students were given a course evaluation prior to the first exam and
(1993) experimental psychology students from again after receiving their final grades. From pre to post, students
Radford University anticipating a low grade lowered their evaluation of the course and
students anticipating a high grade raised their evaluation of the
course
Smith and Experimental 240 introductory Students were randomly assigned to 1 of 3 approaches to university
Smith (2009) psychology students grading: a 100-point system, a percentage system, and an open point
system. Significant differences were found for motivation,
confidence, and effort, but not for perceptions of achievement or
accuracy
Grades and student course evaluations. Students in system with 25 percent of the grades at the top, 50 percent
higher education routinely evaluate the quality of their in the middle, and 25 percent at the bottom (Winter, 1993).
course experiences and their instructors’ teaching. The Working from European models, American universities
relationship between course grades and course evaluations invented systems for ranking and categorizing students
has been of interest for at least 40 years (Abrami, Dickens, based both on academic performance and on progress,
Perry, & Leventhal, 1980; Holmes, 1972) and is a sub- conduct, attentiveness, interest, effort, and regular
question in the general research about student evaluations attendance at class and chapel (Cureton, 1971; Rugg, 1918;
of courses (e.g., Centra, 1993; Marsh, 1984, 1987; Schneider & Hutt, 2014). Grades were ubiquitous at all
McKeachie, 1979; Spooren, Brockx, & Mortelmans, levels of education at the turn of the 20th century, but were
2013). The hypothesis is straightforward: students will idiosyncratically determined (Schneider & Hutt, 2014), as
give higher course evaluations to faculty who are lenient described earlier.
graders. This grade-leniency theory (Love & Kotchen,
2010; McKenzie, 1975) has long been lamented, To resolve inconsistencies, educators turned to the new
particularly by faculty who perceive themselves as science of statistics, and a concomitant passion for
rigorous graders and do not enjoy favorable student measuring and ranking human characteristics (Pearson,
evaluations. This assumption is so prevalent that it is close 1930). Inspired by the work of his cousin, Charles Darwin,
to accepted as settled science (Ginexi, 2003; Marsh, 1987; Francis Galton pioneered the field of psychometrics,
Salmons, 1993). Ginexi posited that the relationship extending his efforts to rank one’s fitness to produce high
between anticipated grades and course evaluation ratings quality offspring on an A to D scale (Galton & Galton,
could be a function of cognitive dissonance (between the 1998). Educators began to debate how normal curve
student’s self-image and an anticipated low grade), or of theory and other scientific advances should be applied to
revenge theory (retribution for an anticipated low grade). grading. As with K–12 education, the consensus was that
Although Maurer (2006) argued that revenge theory is the 0–100 marking system led to an unjustified implication
popular among faculty receiving low course evaluations, of precision, and that the normal curve would allow for
both his study and an earlier study by Kasten and Young transformation of student ranks into A-F or other
(1983) did not find this to be the case. These authors categories (Rugg, 1918).
therefore argued for the cognitive dissonance model, where
attributing poor teaching to the perceived lack of student Meyer (1908) argued for grade categories as follows:
success is an intrapersonal face-saving device. excellent (3 percent of students), superior (22 percent),
medium (50 percent), inferior (22 percent), and failure (3
A critical look at the literature presents an alternative percent). He argued that a student picked at random is as
argument. First, the relationship between anticipated likely to be of medium ability as not. Interestingly,
grades and course evaluation ratings is moderate at best. Meyer’s terms for the middle three grades (superior,
Meta-analytic work (Centra & Creech, 1976; Feldman, medium, and inferior) are norm-referenced; whereas, the
1997) suggests correlations between .10 and .30, or that two extreme grades (excellent and failure) are criterion-
anticipated grades account for less than 10 percent of the referenced. Roughly a decade later, Nicolson (1917) found
variance in course evaluations. It therefore appears that that 36 out of 64 colleges were using a 5-point scale for
anticipated grades have little influence on student grading, typically A–F. The questions debated at the time
evaluations. Second, the relationship between anticipated were more over the details of such systems as opposed to
grades and course evaluations could simply reflect an the overall approach. As Rugg (1918) stated:
honest assessment of students’ opinions of instruction, Now the term inherited capacity practically defines
which varies according to the students’ experiences of the itself. By it we mean the “start in life;” the sum
course (Smith & Smith, 2009). Students who like the total of nervous possibilities which the infant has at
instructional approach may be expected to do better than birth and to which, therefore, nothing that the
students who do not. Students exposed to exceptionally individual himself can do will contribute in any
good teaching might be expected to do well in the course way whatsoever. (p. 706)
and to rate the instruction highly (and vice versa for poor
instruction). Although face-saving or revenge might occur, Rugg went on to say that educational conditions interact
a fair amount of honest and accurate appraisal of the with inherited capacity, resulting in what he called “ability-
quality of teaching might be reflected in the observed to-do” (p. 706). He recommended basing teachers’ marks
correlations. on observations of students’ performance that reflect those
abilities, and that grades should form a normal distribution.
Historical changes in expectations for grade That is, the normal distribution should form a basis for
distributions. The roots of grading in higher education checking the quality of the grades that teachers assign.
can be traced back hundreds of years. In the 16th century, This approach reduces grading to determining the number
Cambridge University developed a three tier grading of grading divisions and the number of students who
should fall into each category. Thus, there is a shift from a increasingly unpopular, so did the pressure on professors
decentralized and fundamentally haphazard approach to not to fail students and make them subject to the draft. The
assigning grades to one that is based on “scientific” (p. effect of the draft on grading practices in higher education
701) principle. Furthermore, Rugg argued that letter is unmistakable (Rojstaczer & Healy, 2012). The
grades were preferable to percentage grades as they more proportion of A and B grades rose dramatically during the
accurately represented the level of precision that was years of the draft; the proportion of D and F grades fell
possible. concomitantly.
Another interesting aspect of Rugg’s (1918) and Meyer’s Grades have risen again dramatically in the past 25 years.
(1908) work is the notion that grades should simply be a Rojstaczer and Healy (2012) argued that this resulted from
method of ranking students, and not necessarily used for new views of students as consumers, or even customers,
making decisions about achievement. Although Meyer and away from viewing students as needing discipline.
argued that three percent should fail a typical course (and Others have contended that faculty inflate grades to vie for
he feared that people would see this as too lenient), he was good course ratings (the grade-leniency theory, Love &
less certain about what to do with the “inferior” group, Kotchen, 2010). Or, perhaps students are higher-achieving
stating that grades should solely represent a student’s rank than they were and deserve better grades.
in the class. In hindsight, these approaches seem
reductionist at best. Although the notion of grading “on Discussion: What Do Grades Mean?
the curve” remained popular through at least through the This review shows that over the past 100 years teacher-
early 1960s, a categorical (A-F) approach to assigning assigned grades have been maligned by researchers and
grades was implemented. This system tended to mask pyschometricians alike as subjective and unreliable
keeping a close eye on the notion that not too many As nor measures of student academic achievement (Allen, 2005;
too many Fs were handed out (Guskey, 2000; Kulick & Banker, 1927; Carter, 1952; Evans, 1976; Hargis, 1990;
Wright, 2008). The normal curve was the “silent partner” Kirschenbaum et al., 1971; Quann, 1983; Simon &
of the grading system. Bellanca, 1976). However, others have noted that grades
are a useful indicator of numerous factors that matter to
In the U.S. in the 1960s, a confluence of technical and students, teachers, parents, schools, and communities
societal events led to dramatic changes in perspectives (Bisesi, Farr, Greene, & Haydel, 2000; Folzer-Napier,
about grading. These were criterion-referenced testing 1976; Linn, 1982). Over the past 100 years, research has
(Glaser, 1963), mastery learning and mastery testing attempted to identify the different components of grades in
(Bloom, 1971; Mayo, 1970), the Civil Rights movement, order to inform educational decision making (Bowers,
and the war in Vietnam. Glaser brought forth the 2009; Parsons, 1959). Interestingly, although standardized
innovative idea that sense should be made out of test assessment scores have been shown to have low criterion
performance by “referencing” performance not to a validity for overall schooling outcomes (e.g., high school
norming group, but rather to the domain whence the test graduation and admission to post-secondary institutions),
came; students’ performance should not be based on the grades consistently predict K-12 educational persistence,
performance of their peers. The proper referent, according completion, and transition from high school to college
to Glaser, was the level of mastery on the subject matter (Atkinson & Geiser, 2009; Bowers et al., 2013).
being assessed. Working from Carroll’s model of school
learning (Carroll, 1963), Bloom developed the underlying One hundred years of quantitative studies of the
argument for mastery learning theory: that achievement in composition of K-12 report card grades demonstrate that
any course (and by extension, the grade received) should teacher-assigned grades represent both the cognitive
be a function of the quality of teaching, the perseverance of knowledge measured in standardized assessment scores
the student, and the time allowed for the student to master and, to a smaller extent, non-cognitive factors such as
the material (Bloom, 1971; Guskey, 1985). substantive engagement, persistence, and positive school
behaviors (e.g., Bowers, 2009, 2011; Farkas et al., 1990;
It was not the case that the work of Bloom (1971) and Klapp Lekholm & Cliffordson, 2008, 2009; Miner, 1967;
Glaser (1963) single-handedly changed how grading took Willingham et al., 2002). Grades are useful in predicting
place in higher education, but ideas about teaching and and identifying students who may face challenges in either
learning partially inspired by this work led to a substantial the academic component of schooling or in the socio-
rethinking of the proper aims of education. Bring into this behavioral domain (e.g., Allensworth, 2013; Allensworth
mix a national reexamination of status and equity, and the & Easton, 2007; Allensworth et al., 2014; Atkinson &
time was ripe for a humanistic and social reassessment of Geiser, 2009; Bowers, 2014).
grading and learning in general. The final ingredient in the
mix was the war in Vietnam. The U.S. had its first The conclusion is that grades typically represent a mixture
conscription since World War II, and as the war grew of multiple factors that teachers value. Teachers recognize
the important role of effort in achievement and motivation Bailey, 2010; Marzano & Hefflebower, 2011; O’Connor,
(Aronson, 2008; Cizek et al., 1995; Cross & Frary, 1999; 2009; Scriffiny, 2008). Although measurement experts
Duncan & Noonan, 2007; Guskey, 2002, 2009b; Imperial, and professional developers may wish grades were
2011; Kelly, 2008; Liu, 2008a; McMillan, 2001; McMillan unadulterated measures of what students have learned and
& Lawson, 2001; McMillan et al., 2002; McMillan & are able to do, strong evidence indicates that they are not.
Nash, 2000; Randall & Engelhard, 2009, 2010; Russell &
Austin, 2010; Sun & Cheng, 2013; Svennberg et al., 2014; For those who wish grades could be a more focused
Troug & Friedman, 1996; Yesbeck, 2011). They measure of achievement of intended instructional
differentiate academic enablers (McMillan, 2001, p. 25) outcomes, future research needs to cast a broader net. The
like effort, ability, improvement, work habits, attention, value teachers attach to effort and other academic enablers
and participation, which they endorse as relevant to in grades and their insistence that grades should be fair
grading, from other student characteristics like gender, point to instructional and societal issues that are well
socioeconomic status, or personality, which they do not beyond the scope of grading. Why, for example, do some
endorse as relevant to grading. students who sincerely try to learn what they are taught not
achieve the intended learning outcomes? Two important
This quality of graded achievement as a multidimensional possibilities include intended learning outcomes that are
measure of success in school may be what makes grades developmentally inappropriate for these students (e.g.,
better predictors of future success in school than tested these students lack readiness or prior instruction in the
achievement (Atkinson & Geiser, 2009; Barrington & domain), and poorly designed lessons that do not make
Hendricks, 1989; Bowers, 2014; Cairns et al., 1989; clear what students are expected to learn, do not instruct
Cliffordson, 2008; Ekstrom et al., 1986; Ensminger & students in appropriate ways, and do not arrange learning
Slusarcick, 1992; Finn, 1989; Fitzsimmons et al., 1969; activities and formative assessments in ways that help
Hargis, 1990; Lloyd, 1974, 1978; Morris et al., 1991; students learn well. Research focusing solely on grades
Rumberger, 1987; Troob, 1985; Voss et al., 1966), typically misses antecedent causes. Future research should
especially given known limitations of achievement testing make these connections. For example, does more of the
(Nichols & Berliner, 2007; Polikoff, Porter, & Smithson, variance in grades reflect achievement in classes where
2011). In the search for assessments of non-cognitive lessons are high-quality and appropriate for students? Is a
factors that predict educational outcomes (Heckman & negatively skewed grade distribution, where most students
Rubinstein, 2001; Levin, 2013), grades appear to be useful. achieve and very few fail, effective for the purposes of
Current theories postulate that both cognitive and non- certifying achievement, communicating with students and
cognitive skills are important to acquire and build over the parents, passing students to the next grade, or predicting
course of life. Although non-cognitive skills may help future educational success? Do changes in instructional
students to develop cognitive skills, the reverse is not true design lead to changes in grading practices, in grade
(Cunha & Heckman, 2008). distributions, and in the usefulness of grades as predictors
of future educational success?
Teachers’ values are a major component in this
multidimensional measure. Besides academic enablers, This review suggests that most teachers’ grades do not
two other important teacher values work to make graded yield a pure achievement measure, but rather a
achievement different from tested achievement. One is the multidimensional measure dependent on both what the
value that teachers place on being fair to students (Bonner, students learn and how they behave in the classroom. This
2016; Bonner & Chen, 2009; Brookhart, 1994; Grimes, conclusion, however, does not excuse low quality grading
2010; Hay & MacDonald, 2008; Sun & Cheng, 2013; practices or suggest there is no room for improvement.
Svennberg et al., 2014; Tierney et al., 2011). In their One hundred years of grading research have generally
concept of fairness, most teachers believe that students confirmed large variation among teachers in the validity
who try should not fail, whether or not they learn. Related and reliability of grades, both in the meaning of grades and
to this concept is teachers’ wish to help all or most students the accuracy of reporting.
be successful (Bonner, 2016; Brookhart, 1994).
Early research found great variation among teachers when
Grades, therefore, must be considered multidimensional asked to grade the same examination or paper. Many of
measures that reflect mostly achievement of classroom these early studies communicated a “what’s wrong with
learning intentions and also, to a lesser degree, students’ teachers” undertone that today would likely be seen as
efforts at getting there. Grades are not unidimensional researcher bias. Early researchers attributed sources of
measures of pure achievement, as has been assumed in the variation in teachers’ grades to one or more of the
past (e.g., Carter, 1952; McCandless et al., 1972; Moore, following sources: criteria (Ashbaugh, 1924; Brimi, 2011;
1939; Ross & Hooks, 1930) or recommended in the present Healy, 1935; Silberstein, 1922; Sims, 1933, Starch, 1915;
(e.g., Brookhart, 2009, 2011; Guskey, 2000; Guskey & Starch & Elliott, 1913a,b), students’ work quality (Bolton,
1927; Healy, 1935; Jacoby, 1910; Lauterbach, 1928; and other academic enablers when determining grades
Shriner, 1930; Sims, 1933), teacher severity/leniency (Cox, 2011; Hay & McDonald, 2008; McMunn et al.,
(Shriner, 1930; Silberstein, 1922; Sims, 1933; Starch, 2003).
1915; Starch & Elliott, 1913b), task (Silberstein, 1922;
Starch & Elliott, 1913a), scale (Ashbaugh, 1924; Sims, Future research in this area should seek ways to help
1933; Starch 1913, 1915), and teacher error (Brimi, 2011; teachers improve the criteria they use to grade, their skill at
Eells, 1930; Hulten, 1925; Lauterbach, 1928, Silberstein, identifying levels of quality on the criteria, and their ability
1922; Starch & Elliott, 1912, 1913a,b). Starch (1913, to effectively merge these assessment skills and
Starch & Elliott 1913b) found that teacher error and instructional skills. When students are taught the criteria
emphasizing different criteria were the two largest sources by which to judge high-quality work and are assessed by
of variation. those same criteria, grade meaning is enhanced. Even if
grades remain multidimensional measures of success in
Regarding sources of error, Smith (2003) suggested school, the dimensions on which grades are based should
reconceptualizing reliability for grades as a matter of be defensible goals of schooling and should match
sufficiency of information for making the grade students’ opportunities to learn.
assignment. This recommendation is consistent with the
fact that as grades are aggregated from individual pieces of No research agenda will ever entirely eliminate teacher
work to report card or course grades and grade-point variation in grading. Nevertheless, the authors of this
averages, reliability increases. The reliability of overall review have suggested several ways forward. Investigating
college grade-point average is estimated at .93 (Beatty, grading in the larger context of instruction and assessment
Walmsley, Sackett, Kuncel, & Koch, 2015). will help focus research on important sources and causes of
invalid or unreliable grading decisions. Investigating ways
In most studies investigating teachers’ grading reliability, to differentiate instruction more effectively, routinely, and
teachers were sent examination papers without specific easily will reduce teachers’ feelings of pressure to pass
grading criteria and simply asked to assign grades. Today, students who may try but do not reach an expected level of
this lack of clear grading criteria would be seen as a achievement. Investigating the multidimensional construct
shortcoming in the assessment process. Most of these of “success in school” will acknowledge that grades
studies thus confounded teachers’ inability to judge student measure something significant that is not measured by
work consistently and random error, considering both achievement tests. Investigating ways to help teachers
teacher error. Rater training offers a modern solution to develop skills in writing or selecting and then
this situation. Research has shown that with training on communicating criteria, and recognizing these criteria in
established criteria, individuals can judge examinees’ work students’ work, will improve the quality of grading. All of
more accurately and reliably (Myford, 2012). these seem reachable goals to achieve before the next
Unfortunately, most teachers and professors today are not century of grading research. All will assuredly contribute
well trained, typically grade alone, and rarely seek help to enhancing the validity, reliability, and fairness of
from colleagues to check the reliability of their grading. grading.
Thus, working toward clearer criteria, collaborating among
teachers, and involving students in the development of Suggested Citation Format:
grading criteria appear to be promising approaches to Brookhart, S. M., Guskey, T. R., Bowers, A. J., McMillan,
enhancing grading reliability. J. H., Smith, J. K., Smith, L. F., Stevens, M.T., Welsh, M.
E. (2016). A Century of Grading Research: Meaning and
Considering criteria as a source of variation in teachers’ Value in the Most Common Educational Measure. Review
grading has implications for grade meaning and validity. of Educational Research, 86(4), 803-848.
The attributes upon which grading decisions are based doi: 10.3102/0034654316672069
function as the constructs the grades are intended to http://doi.org/10.3102/0034654316672069
measure. To the extent teachers include factors that do not
indicate achievement in the domain they intend to measure REFERENCES:
(e.g., when grades include consideration of format and Abrami, P. C., Dickens, W. J., Perry, R. P., & Leventhal,
surface level features of an assignment), grades do not give L. (1980). Do teacher standards for assigning grades
students, parents, or other educators accurate information affect student evaluations of instruction? Journal of
about learning. Furthermore, to the extent teachers do not Educational Psychology, 72, 107–118.
appropriately interpret student work as evidence of doi:10.1037/0022-0663.72.1.107
learning, the intended meaning of the grade is also Adrian, C. A, (2012). Implementing standards-based
compromised. There is evidence that even teachers who grading: Elementary teachers’ beliefs, practices and
explicitly decide to grade solely on achievement of concerns. (Doctoral dissertation). Retrieved from
learning standards sometimes mix effort, improvement,
Journal of the Royal Statistical Society, 51, 599–635. Ginexi, E. M. (2003). General psychology course
Eells, W. C. (1930). Reliability of repeated grading of evaluations: Differential survey response by
essay type examinations. Journal of Educational expected grade. Teaching of Psychology, 30, 248–251.
Psychology, 21, 48–52. Glaser, R. (1963). Instructional technology and the
Ekstrom, R. B., Goertz, M. E., Pollack, J. M., & Rock, D. measurement of learning outcomes: Some questions.
A. (1986). Who drops out of high school and why? American Psychologist, 18, 519. doi:10.1111/j.1745-
Findings from a national study. Teachers College 3992.1994.tb00561.x
Record, 87, 356–373. Gleason, P., & Dynarski, M. (2002). Do we know whom to
Ensminger, M. E., & Slusarcick, A. L. (1992). Paths to serve? Issues in using risk factors to identify dropouts.
high school graduation or dropout: A longitudinal Journal of Education for Students Placed at Risk, 7,
study of a first-grade cohort. Sociology of Education, 25–41. doi:10.1207/S15327671ESPR0701_3
65, 91–113. doi:10.2307/2112677 Grimes, T. V. (2010). Interpreting the meaning of grades:
European Commission. (2009). ECTS user’s guide. A descriptive analysis of middle school teachers'
Luxembourg, Belgium: Office for Official assessment and grading practices. (Doctoral
Publications of the European Communities. dissertation). Retrieved from ProQuest. (305268025)
doi:10.2766/88064 Grindberg, E. (2014, April 7). Ditching letter grades for a
Evans, F. B. (1976). What research says about grading. In ‘window’ into the classroom. Cable News Network.
S. B. Simon & J. A. Bellanca (Eds.), Degrading the Retrieved from:
grading myths: A primer of alternatives to grades and http://www.cnn.com/2014/04/07/living/report-card-
marks (pp. 30–50). Washington, DC: Association for changes-standards-based-grading-schools/
Supervision and Curriculum Development. Guskey, T. R. (1985). Implementing mastery learning.
Farkas, G., Grobe, R. P., Sheehan, D., & Shuan, Y. (1990). Belmont, CA: Wadsworth.
Cultural resources and school success: Gender, Guskey, T. R. (2000). Grading policies that work against
ethnicity, and poverty groups within an urban school standards…and how to fix them. IASSP Bulletin,
district. American Sociological Review, 55, 127–142. 84(620), 20–29. doi:10.1177/019263650008462003
doi:10.2307/2095708 Guskey, T. R. (2002, April). Perspectives on grading and
Farr, B. P. (2000). Grading practices: An overview of the reporting: Differences among teachers, students, and
issues. In E. Trumbull & B. Farr (Eds.), Grading and parents. Paper presented at the Annual Meeting of the
reporting student progress in an age of standards (pp. American Educational Research Association, New
1–22). Norwood, MA: Christopher-Gordon. Orleans, LA.
Feldman, K. A. (1997). Identifying exemplary teachers and Guskey, T. R. (2004). The communication challenge of
teaching: Evidence from student ratings. In R. P. Perry standards-based reporting. Phi Delta Kappan, 86,
& J. C. Smart (Eds.), Effective teaching in higher 326–329. doi:10.1177/003172170408600419
education: Research and practice (pp. 93–143). New Guskey, T. R. (2009a). Grading policies that work against
York, NY: Agathon Press. standards… And how to fix them. In T.R. Guskey
Finn, J. D. (1989). Withdrawing from school. Review of (Ed.), Practical solutions for serious problems in
Educational Research, 59, 117–142. standards-based grading (pp. 9–26). Thousand Oaks,
doi:10.3102/00346543059002117 CA: Corwin.
Fitzsimmons, S. J., Cheever, J., Leonard, E., & Guskey, T. R. (2009b, April). Bound by tradition:
Macunovich, D. (1969). School failures: Now and Teachers' views of crucial grading and reporting
tomorrow. Developmental Psychology, 1, 134–146. issues. Paper presented at the Annual Meeting of the
doi:10.1037/h0027088 American Educational Research Association, San
Folzer-Napier, S. (1976). Grading and young children. In Francisco, CA.
S. B. Simon & J. A. Bellanca (Eds.), Degrading the Guskey, T., & Bailey, J. (2001). Developing grading and
grading myths: A primer of alternatives to grades and reporting systems for student learning. Thousand
marks (pp. 23–27). Washington, DC: Association for Oaks, CA: Corwin.
Supervision and Curriculum Development. Guskey, T.R. & Bailey, J.M. (2010). Developing standards
Frary, R. B., Cross, L. H., & Weber, L. J. (1993). Testing based report cards. Thousand Oaks, CA: Corwin.
and grading practices and opinions of secondary Guskey, T. R., Swan, G. M. & Jung, L. A. (2010,
teachers of academic subjects: Implications for April). Developing a statewide, standards-based
instruction in measurement. Educational student report card: A review of the Kentucky
Measurement: Issues & Practice, 12(3), 23–30. initiative. Paper presented at the Annual Meeting of
doi:10.1111/j.1745-3992.1993.tb00539.x the American Educational Research Association,
Galton, D. J., & Galton, C. J. (1998). Francis Galton: and Denver, CO.
eugenics today. Journal of Medical Ethics, 24, 99– Hargis, C. H. (1990). Grades and grading practices:
105. Obstacles to improving education and helping at-risk
students. Springfield, MA: Charles C. Thomas. education. New York, NY: Hart.
Hay, P. J., & Macdonald, D. (2008). (Mis)appropriations Klapp Lekholm, A. (2011). Effects of school
of criteria and standards-referenced assessment in a characteristics on grades in compulsory school.
performance-based subject. Assessment in Education: Scandinavian Journal of Educational Research, 55,
Principles, Policy & Practice, 15, 153–168. 587–608. doi:10.1080/00313831.2011.555923
doi:10.1080/09695940802164184 Klapp Lekholm, A., & Cliffordson, C. (2008).
Healy, K. L. (1935). A study of the factors involved in the Discrepancies between school grades and test scores at
rating of pupils’ compositions. Journal of individual and school level: effects of gender and
Experimental Education, 4, 50–53. family background. Educational Research and
doi:10.1080/00220973.1935.11009995 Evaluation, 14, 181–199.
Heckman, J. J., & Rubinstein, Y. (2001). The importance doi:10.1080/13803610801956663
of noncognitive skills: Lessons from the GED testing Klapp Lekholm, A., & Cliffordson, C. (2009). Effects of
program. The American Economic Review, 91, 145– student characteristics on grades in compulsory
149. doi:10.2307/2677749 school. Educational Research and Evaluation, 15, 1–
Hill, G. (1935). The report card in present practice. 23. doi:10.1080/13803610802470425
Educational Method, 15, 115–131. Kulick, G., & Wright, R. (2008). The impact of grading on
Holmes, D. S. (1972). Effects of grades and disconfirmed the curve: A simulation analysis. International Journal
grade expectancies on students’ evaluations of their for the Scholarship of Teaching and Learning, 2(2), 5.
instructor. Journal of Educational Psychology, 63, Kunnath, J. P. (2016). A critical pedagogy perspective of
130–133. the impact of school poverty level on the teacher
Howley, A., Kusimo, P. S. & Parrott, L. (1999). Grading grading decision-making process (Doctoral
and the ethos of effort. Learning Environments dissertation). Retrieved from ProQuest. (10007423)
Research, 3, 229–246. doi:10.1023/A:1011469327430 Lauterbach, C. E. (1928). Some factors affecting teachers’
Hulten, C. E. (1925). The personal element in teachers’ marks. Journal of Educational Psychology, 19, 266–
marks. Journal of Educational Research, 12, 49–55. 271.
doi:10.1080/00220671.1925.10879575 Levin, H. M. (2013). The utility and need for incorporating
Imperial, P. (2011). Grading and reporting purposes and noncognitive skills into large-scale educational
practices in catholic secondary schools and grades' assessments. In M. von Davier, E. Gonzalez, I. Kirsch
efficacy in accurately communicating student & K. Yamamoto (Eds.), The role of international
learning (Doctoral dissertation). Retrieved from large-scale assessments: Perspectives from
ProQuest. (896956719) technology, economy, and educational research (pp.
Jacoby, H. (1910). Note on the marking system in the 67–86). Dordrecht, Netherlands: Springer.
astronomical course at Columbia College, 1909–1910. Linn, R. L. (1982). Ability testing: Individual differences,
Science, 31, 819–820. doi:10.1126/science.31.804.819 prediction, and differential prediction. In A. K.
Jimerson, S. R., Egeland, B., Sroufe, L. A., & Carlson, B. Wigdor & W. R. Garner (Eds.), Ability testing: Uses,
(2000). A prospective longitudinal study of high consequences, and controversies (pp. 335–388).
school dropouts examining multiple predictors across Washington, DC: National Academy Press.
development. Journal of School Psychology, 38, 525– Liu, X. (2008a, October). Measuring teachers’ perceptions
549. doi:10.1016/S0022-4405(00)00051-0 of grading practices: Does school level make a
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), difference? Paper presented at the Annual Meeting of
Educational Measurement (4th ed., pp. 17–64). the Northeastern Educational Research Association,
Westport, CT: American Council on Rocky Hill, CT.
Education/Praeger. Liu, X. (2008b, October). Assessing measurement
Kasten, K. L., & Young, I. P. (1983). Bias and the intended invariance of the teachers’ perceptions of grading
use of student evaluations of university practices scale across cultures. Paper presented at the
faculty. Instructional Science, 12, 161–169. Annual Meeting of the Northeastern Educational
doi:10.1007/BF00122455 Research Association, Rocky Hill, CT.
Kelly, F. J. (1914). Teachers’ marks: Their variability and Liu, X., O'Connell, A. A., & McCoach, D. B. (2006,
standardization. Contributions to Education No. 66. April). The initial validation of teachers' perceptions
New York, NY: Teachers College, Columbia of grading practices. Paper presented at the Annual
University. Meeting of the American Educational Research
Kelly, S. (2008). What types of students' effort are Association, San Francisco, CA.
rewarded with high marks? Sociology of Education, Llosa, L. (2008). Building and supporting a validity
81, 32–52. doi:10.1177/003804070808100102 argument for a standards-based classroom assessment
Kirschenbaum, H., Napier, R., & Simon, S. B. (1971). of English proficiency based on teacher judgments.
Wad-ja-get? The grading game in American Educational Measurement: Issues and Practice, 27(3),
educational research (3rd ed.) (pp. 783–791). New importance for the prediction of upper secondary
York, NY: Macmillan. school grades. Scandinavian Journal of Educational
Smith, J. K. (2003). Reconsidering reliability in classroom Research, 58, 127–146.
assessment and grading. Educational Measurement: doi:10.1080/00313831.2012.705322
Issues and Practice, 22(4), 26–33. doi:10.1111/j.1745- Thorsen, C., & Cliffordson, C. (2012). Teachers' grade
3992.2003.tb00141.x assignment and the predictive validity of criterion-
Smith, J. K., & Smith, L. F. (2009). The impact of framing referenced grades. Educational Research and
effect on student preferences for university grading Evaluation, 18, 153–172.
systems. Studies in Educational Evaluation, 35, 160– doi:10.1080/13803611.2012.659929
167. doi:10.1016/j.stueduc.2009.11.001 Tierney, R. D., Simon, M., & Charland, J. (2011). Being
Snow, R. E. (1989). Toward assessment of cognitive and fair: Teachers’ interpretations of principles for
conative structures in learning. Educational standards-based grading. The Educational Forum, 75,
Researcher, 18(9), 8–14. 210–227. doi:10.1080/00131725.2011.577669
doi:10.3102/0013189x018009008 Troob, C. (1985). Longitudinal study of students entering
Sobel, F. S. (1936). Teachers' marks and objective tests as high school in 1979: The relationship between first
indices of adjustment. Teachers College Record, 38, term performance and school completion. New York,
239–240. NY: New York City Board of Education.
Spooren, P. Brockx, B, & Mortelmans, D. (2013). On the Troug, A. J., & Friedman, S. J. (1996). Evaluating high
validity of student evaluation of teaching: The state of school teachers’ written grading policies from a
the art. Review of Educational Research, 83, 598–642. measurement perspective. Paper presented at the
doi:10.3102/0034654313496870 annual meeting of the National Council on
Stanley, G., & Baines, L. (2004). No more shopping for Measurement in Education, New York.
grades at B-Mart: Re-establishing grades as indicators Unzicker, S. P. (1925). Teachers’ marks and intelligence.
of academic performance. The Clearing House: A The Journal of Educational Research, 11, 123–131.
Journal of Educational Strategies, Issues and Ideas, doi:10.1080/00220671.1925.10879537
77, 101–104. doi:10.1080/00098650409601237 U. S. Department of Education. (1999). What Happens in
Starch, D. (1913). Reliability and distribution of grades. Classrooms? Instructional Practices in Elementary
Science, 38, 630–636. doi:10.1126/science.38.983.630 and Secondary Schools, 1994–95, NCES 1999–348,
Starch, D. (1915). Can the variability of marks be reduced? by R. R. Henke, X. Chen, G. Goldman, M. Rollefson,
School and Society, 2, 242–243. & K. Gruber. Washington, DC: Author. Retrieved
Starch, D., & Elliott, E. C. (1912). Reliability of the from http://nces.ed.gov/pubs99/1999348.pdf
grading of high-school work in English. School Voss, H. L., Wendling, A., & Elliott, D. S. (1966). Some
Review, 20, 442–457. types of high school dropouts. The Journal of
Starch, D., & Elliott, E. C. (1913a). Reliability of grading Educational Research, 59, 363–368.
work in mathematics. School Review, 21, 254–259. Webster, K. L. (2011). High school grading practices:
Starch, D., & Elliott, E. C. (1913b). Reliability of grading Teacher leaders’ reflections, insights, and
work in history. School Review, 21, 676–681. recommendations (Doctoral dissertation). Retrieved
Sun, Y., & Cheng, L. (2013). Teachers' grading practices: from ProQuest. (3498925)
Meaning and values assigned. Assessment in Welsh, M. E., & D'Agostino, J. (2009). Fostering
Education: Principles, Policy & Practice, 21, 326– consistency between standards-based grades and large-
343. doi:10.1080/0969594.2013.768207 scale assessment results. In T. R. Guskey
Svennberg, L., Meckbach, J., & Redelius, K. (2014). (Ed.), Practical solutions for serious problems in
Exploring PE teachers' 'gut feelings': An attempt to standards-based grading (pp. 75–104). Thousand
verbalise and discuss teachers' internalised grading Oaks, CA: Corwin.
criteria. European Physical Education Review, 20, Welsh, M. E., D’Agostino, J. V., & Kaniskan, R. (2013).
199–214. doi:10.1177/1356336X13517437 Grading as a reform effort: Do standards-based grades
Swan, G. M., Guskey, T. R., & Jung, L. A. (2014). converge with test scores? Educational Measurement:
Parents’ and teachers’ perceptions of standards-based Issues and Practice, 32(2), 26–36.
and traditional report cards. Educational Assessment, doi:10.1111/emip.12009
Evaluation and Accountability, 26, 289–299. Wiggins, G. (1994). Toward better report cards.
doi:10.1007/s11092-014-9191-4 Educational Leadership, 52(2), 28–37. Retrieved
Swineford, F. (1947). Examination of the purported from: http://www.ascd.org/publications/educational-
unreliability of teachers' marks. The Elementary leadership/oct94/vol52/num02/Toward-Better-Report-
School Journal, 47, 516–521. doi:10.2307/3203007 Cards.aspx
Thorsen, C. (2014). Dimensions of norm-referenced Wiley, C. R. (2011). Profiles of teacher grading practices:
compulsory school grades and their relative Integrating teacher beliefs, course criteria, and