Desinging Assessment of Performance in Mathematics
Desinging Assessment of Performance in Mathematics
Summary
The effective implementation of intended curricula that emphasise problem solving
processes requires high-stakes tests that will recognize and reward these aspects of
performance across a range of contexts and content. In this paper we discuss the
challenge of designing such tests, a set of principles for doing so well, and strategies and
tactics for turning those principles into tasks and tests that will work well in practice.
While the context is England, the issues raised have wider relevance.
1. Introduction
Everyone concerned with education recognizes the importance of assessment – but most
would like to minimise its role. Parents accept tests as necessary hurdles on the route
to valuable qualifications for their children – but are concerned at the pressure and the
consequences of failure. Many teachers accept the importance of tests – but also feel
threatened and, believing that they “know” their students capabilities, see the tests as a
disruption of teaching and learning. Politicians see tests as the key to accountability and
the way to prove the success of their initiatives – but want to minimize the cost and the
pressures on them that complex tests generate, through appeals against scoring, for
example. All would like the tests to be “fair”, objective, and easy to understand. These
very different motivations have led to high-stakes tests of Mathematics in England that
assess only some elements of mathematical performance, mainly concepts and skills
tested separately.
The revised national curriculum for the age range 11-16 takes a much broader, more holistic
view of performance in mathematics, in line with high international standards. It focuses on
developing the “Key Concepts” and “Key Processes” below across familiar content areas:
Number and Algebra; Geometry and Measures; Statistics.
• Competence • Representing
• Creativity • Analysing (reasoning)
• Applications and implications of mathematics • Analysing (procedures)
• Critical understanding • Interpreting and evaluating
• Communicating and reflecting
1
A revised version of this draft paper will be submitted to Educational Designer, the forthcoming
e-journal of ISDDE. We appreciate the opportunity to discuss some of the issues at ISDDE08.
mathematics itself(see e.g. Schoenfeld 1985, 2007, Blum et al 2007, Burkhardt with
Pollak 2006, Burkhardt and Bell 2007):
• knowledge and skills,
• strategies and tactics;
• metacognition;
• attitude .
This paper is concerned with how assessment, particularly high-stakes tests, can be
aligned with these performance goals.
Here we shall focus mainly on the core design challenge – the tasks that enable students
to show what they know, understand and can do (Cockcroft 1982) and the scoring
schemes that assign credit for the various aspects of performance. Tasks may be
packaged into tests in many ways – a largely separate, sometimes contentious issue; we
shall also outline a process for building balanced tests from tasks. Finally, we shall
comment on the process of implementing improved assessment, turning principles into
practice in a way that the system can absorb, without undermining the always-good
intentions.
Our approach, as ever, is that of engineering research in education – “the design and
development of tools and processes that are new or substantially improved” (RAE 2001,
2006). It reflects several decades of experience working with examining bodies,
nationally and internationally.
2
An example may help show why balance across the performance goals is crucial. If, for reasons
of economy and simplicity, it were decided to assess the decathlon on the basis of the 100 meter
race alone, it would surely distort decathletes’ training programmes. This has happened in
Mathematics where process aspects of performance are not currently assessed – or taught.
Mathematics Assessment Design 2 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
The importance of 'alignment' between assessment and the curriculum goals is widely
recognised internationally, particularly where narrow tests are undermining the
achievement of those goals. In its Curriculum and Evaluation Standards, the US National
Council of Teachers of Mathematics stressed that “assessment practice should mirror the
curriculum we want to develop; its goals, objectives, content and the desired
instructional approaches”, adding
"An assessment instrument that contains many computational items and
relatively few problem-solving questions, for example, is poorly aligned with a
curriculum that stresses problem solving and reasoning. Similarly, an assessment
instrument highly aligned with a curriculum that emphasises the integration of
mathematical knowledge must contain tasks that require such integration. And,
for a curriculum that stresses mathematical power, assessment must contain
tasks with non-unique solutions." (NCTM 1989 p 194-5).
These points are again emphasized in the NCTM 2001 revision, Principles and Standards
for School Mathematics.
To summarise in rather more technical terms, the implemented(or enacted) curriculum
will inevitably be close to the tested curriculum. If you wish to implement the intended
curriculum, the tests must cover its goals in a balanced way. Ignoring Roles B or C
undermines policy decisions; accepting their inevitability has profound implications for
the design of high-stakes tests.
This can be an opportunity rather than, as at present, a problem. Both informal
observation (e.g. with well-engineered course work) and research (e.g. Barnes, Clarke
and Stephens 2000) have shown that well-designed assessment can be a uniquely
powerful lever for forwarding large-scale improvement.
3
Mathematical modelling, a useful term, is not yet in everyday use in school mathematics in the
UK. Indeed, “modeling” is often thought of as a rather advanced and sophisticated process, used
only by professionals. That is far from the truth. We have all been modelling for a long time.
Mathematics Assessment Design 3 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
If a 16-year-old student cannot find x without being led through the task by (a) and (b),
is this worthwhile performance in mathematics? Will this fragmentary skill equip the
student for subsequent work, or for life? For the student who can do the task without
the aid of (a) and (b), this already-simple problem is further trivialized by fragmentation.
This fragmentation arose as the inevitable, but unrecognized, consequence, of adopting
the model that separate elements of performance have levels that reflect their difficulty,
as specified National Curriculum. Since, in fact, the difficulty of a complex task is not
simply that of its parts, the model could only be sustained by testing the parts
separately. This is a travesty of performance in mathematics. (If English were assessed
in an equivalent way, it would test only spelling and grammar through short items, with
no essays or other substantial writing)
It is now accepted that the difficulty of a task depends on the total cognitive load, which
is determined by the interaction of various aspects of the task including its:
• Complexity
• Unfamiliarity
• Technical demand
as reflected in the chains of reasoning needed for developing a solution. Assessing the
level of a student performance on a task has to take all these into account. (This is
actually a return to earlier practice in assessing mathematics.)
What does this mean for task design? Let us start by looking at the length of the
expected chain of independent reasoning. Compare the triangle task above to the
Consecutive Sums
Some numbers equal the sum of consecutive natural numbers:
5=2+3
9=4+5
=2+3+4
Find all you can about the properties of such “Consecutive Sums”
Plan how to organise the league, so that the tournament will take the
shortest possible time. Put all the information on a poster so that the
players can easily understand what to do.
Are such problems inevitably more difficult than short items that, in principle, only
demand recall? Empirically, the answer is no, or rather, not if they are well-engineered.
How can this be? There are a number of factors. When the strategic demand, working
out what to try and what mathematical tools to use, is high, the initial technical demand
must be lower – in Consecutive Sums a lot can be done with simple arithmetic, though
some algebra and/or geometry are needed for a fuller analysis. Secondly, it is often
easier to tackle a complete problem than parts that have less meaning (Thurston 1999).
Mathematics Assessment Design 6 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Finally, it is clear to the student that thinking, connecting with their whole knowledge
base, is needed – a very different mode from the usual attempted recall. These are not
unnecessary complications, as some in mathematics assessment might regard them,
but central elements in the performance goals.
The difficulty of a given task can only be reliably determined by trialling with
appropriately prepared students – the usual way well-engineered products are
developed. If needed, in the light of the designer’s insight and feedback from trials,
scaffolding can be added to give students easier access, and a well-engineered ramp of
difficulty. We discuss such design tactics in Section 5 below.
Suppose you start by tossing a head, then a tail, then a head. Where is your
counter now?
List and describe all the faults you notice with the board.
Design a Tent
In this open format, responses proved difficult to assess, mainly because the task is too
ambiguous. Some students design the tent from many pieces of material, while others
use a single piece. Some give unrealistic measurements and it is often impossible to say
whether this is because they cannot estimate the dimensions of an adult, because they
cannot transfer measurements or because they cannot calculate accurately. Some make
assumptions that extra space is needed for baggage, but do not explicitly state this.
Some use trigonometry, others use Pythagoras' theorem, while others use scale
drawing.
With open tasks like this, students often interpret the task in different ways, make
different assumptions, and use different mathematical techniques. In fact, they are
essentially engaged in different tasks. What is more, it is not always possible to infer
their interpretations, assumptions and abilities from the written responses. If students
have not used Pythagoras' theorem, for example, we cannot tell if it is because they are
unable to use it, or simply have chosen not to. This argument may also be applied to
mathematical processes. How can we assess whether or not a student can generalise a
pattern or validate a solution unless we ask them to?
One solution to the transparency/openness issue, is to clearly define the specific
assessment purposes of a package of items for the students, making clear what will be
To achieve rich and robust assessment, tasks are tried and revised many times,
exploring alternative degrees of scaffolding. Another interesting example is given by
Shannon and Zawojewski (1995), where trials on two versions of the same task,
Supemarket Carts are reported. The first, shown in Figure 4, was scaffolded by a series
of questions gently ramped in order of difficulty, starting from specific examples to a
final, generalised 'challenge'.
4
Initially, these expectations need to be stated explicitly, either in the task or for the package as
a whole. Over time, they come to be understood and absorbed
Mathematics Assessment Design 11 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Figure 4.
Supermarket Carts
The diagrams below show 12 supermarket carts that have been "nested" together.
They also show that length of a single supermarket cart is 96 cm and that each cart sticks
out 30 cm beyond the previous one in line.
The second version began with a statement of the generalised problem, essentially the
last two parts in Figure 4. This study found, as one would expect, that students
struggled more with the less structured task and fewer were able to arrive at the general
solution. What was perhaps more interesting was that the students perceived the
purposes of the tasks as qualitatively different. The students saw the structured task as
Mathematics Assessment Design 12 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
assessing content related to equations or functions, while they saw the unstructured task
as assessing how they would develop an approach to a problem. Students had no
suggestions as to how the structured task could be improved, but they had many
suggestions as to how the unstructured could be made to give clearer guidance.
Although they could identify the distinct purposes behind the tasks, they assigned their
difficulties to poor task design rather than their own lack of experience of tackling
unstructured problems. This result, and many studies of the teaching of modeling skills,
emphasis the importance of the alignment between standards, curriculum and
assessment, noted earlier – and the associated need for adequately supportive
professional development to enable teachers to meet the new challenges involved. The
implications of this for implementation are sketched in Section 8
“Differentiation”
An examination must allow all the students who take it to (in Cockcroft’s immortal
phrase) show what they know, understand and can do, without wasting much
examination time in ‘failure activity’. Influenced by the technical limitations of many
students in mathematics and the dominance of technical demand in mathematics
examinations, he decided that this principle required differentiation by task and thus
tiers. This was strongly opposed by other subjects where differentiation by outcome
only is standard practice – essay questions allow students to respond at their own, very
different, levels and scorers to score them appropriately.
It is important to reconsider this issue for broad-spectrum assessment in mathematics of
the kind discussed here. More open tasks allow responses at a wider range of levels.
There are design strategies that can help too.
The “exponential ramp” is a powerful design technique for assessing a wide range of
levels of performance of students. It uses rich substantial tasks that are scaffolded to
offer opportunities at different levels, increasing the challenge in later parts of the task.
Consecutive Sums
Some numbers equal the sum of consecutive natural numbers:
5=2+3
9=4+5
=2+3+4
• Find a property of sums of two consecutive natural numbers.
• Find a property of sums of three consecutive natural numbers.
• Find a property of sums of n consecutive natural numbers
• Which numbers are not “consecutive sums”?
In each case, explain why your results are true.
While this is difficult in technical exercises, it can be done in more open rich tasks. This
scaffolded version of Consecutive Sums gives students easier access, with a well-
engineered ramp of difficulty. Nearly every student can solve the first part; the proof in
the last part is challenging for most people.
However, there are always losses as well as gains from scaffolding – here it means that
students only have to answer questions, not to pose them – the latter is an important
part of thinking with mathematics.
Mathematics Assessment Design 13 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
This and other ways of achieving differentiation by outcome on the same task set can
obviate or reduce the need for ‘tiers’, where different students are given more or less
challenging tasks according to their perceived level of performance. In the US, where
tiers are unacceptable for valid social reasons – essentially the evidence that potentially
high-achieving students from less-advantaged backgrounds are put in lower sets with
lower expectations. This is not given the same priority in the UK, where it is argued that
students should be given tasks that enable them to show what they can do – not what
they cannot. An important and interesting dilemma.
Approaches to scoring
All assessment involves value judgments. The choice of task types defines the range of
performances that are valued. Scoring schemes define how far the various elements of
performance on a task are valued. Thus scoring, aggregating points and reporting of
achievement are major issues in assessment design. Here we shall look at scoring from
a broader perspective than is common in UK mathematics assessment.
First we note that the value system is often distorted by the perceived constraints of
practicality. Scoring schemes, instead of apportioning credit according to the importance
of the elements of performance in the task, assign points to elements that are easy to
identify – answers rather than explanations 5 , for example. Tasks are chosen because
they are “easy to score”, eliminated if scoring may involve judgment. While any high-
stakes assessment system must work smoothly in practice, experience in other subjects
suggests that many of the constraints that are accepted for Mathematics are
unnecessary. We discuss these further in Section 7.
Figure 5
Magazine Cover
This pattern is to appear on the front cover of the school magazine.
You need to call the magazine editor and describe the pattern as clearly as possible in words so
that she can draw it.
Write down what you will say on the phone.
5
Some GCSE scoring schemes give full points for correct answers, even when working or
explanation is asked for – the latter is merely evidence for partial credit if the answers are wrong.
This ignores the value of explanation as evidence of mathematical understanding.
Mathematics Assessment Design 14 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Figure 5 shows Magazine Cover – designed for ages 8-10, it challenges older students.
The scoring scheme uses a fairly standard point system. The total points available are
chosen to be equal to the length of time (in minutes) it takes a typical successful student
to complete the task. This arbitrary choice reflects the need for precision without
overloading examiners’ judgments with too much detail. The total points for each task
are then distributed among the different aspects of performance, so that each aspect is
given a weight appropriate to its importance.
This process is illustrated in the rubric design for Magazine Cover, which shows the
various elements that are credited.
Magazine Cover
Points
The core elements of performance required by
this task are:
• describe a given geometric pattern
A circle. 1
A triangle. 1
Total Points 6
The maximum score is 6 because this is a 6- minute task. For each of the two simple
technical terms, circle and triangle, 1 point is assigned. (Most students get these points.)
For each of the key geometric insights on the triangle (equal sides, point down, touching
the circle), 1 point is given, but further technical terms (e.g., equilateral) are not
required at grade 3. For the color contrast, seen as a key feature in the context of the
task (cover design), 1 point is given, and 1 point is given for some size information. To
keep within the total, a maximum of 6 of the 7 points above may be awarded.
In more complex tasks, more than one point may be assigned for an important and
substantial part of the task, such as for an explanation. The rubric will give guidance on
assigning partial credit for a correct but incomplete explanation.
There are value judgments throughout this and every other rubric. Here, for example,
some mathematics teachers criticise the point for colour – “not mathematics”, but
important in the context of the problem. Others want some credit for the clarity of the
6
German mathematics teachers demand that explanations are in good German
Mathematics Assessment Design 16 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Qu3 Follow through students answers from from Qu 2.
• Correct number of tent faces shown 1 1
• Faces joined together correctly 1 1
• Right angle formed by vertical zip and ground shown correctly 1 1
• Length L consistent with base (Qu2) 1 1
• End flaps consistent with base (W/2 and H) 1 1
• Appropriate method selected to calculate S 1 1
• Method applied to calculate S correctly 1 1
• Correct answer obtained for S. 1 1
• Appropriate method selected to calculate a or angle b 1 1
• Method applied to calculate a or b correctly 1 1
• Correct answer is obtained for angle a or b 1 1
• Correct answer obtained for second angle 1 1
L
W
2
a
S
H b
Totals 12 3 15
Profile scoring There are some advantages in scoring and reporting different aspects
of student performance separately – Concepts, Skills and Problem solving form a popular
trio in the US. This allows teachers to see areas of strength and weakness of individual
students, and of their own teaching. It allows school systems to look at progress more
deeply, particularly as new goals (like problem solving) are introduced into the system.
Data on the effect of the new broader US teaching materials on performance show
roughly the same standards in mathematical skills but real gains in problems solving.
Let us look at the comparison for the Design a Tent task. Figure 6 contains a sample
"pointwise by category" scoring scheme (Balanced Assessment 1995). This assigns
points to separate categories, which are aggregated separately to give a student profile.
Here features of the response are credited under three categories; Problem solving and
reasoning (PSR): 12; Number and Quantity (N&Q): 3; Geometry (G): 15. On some
occasions, where there is a big strategic demand within the task, points will be given for
both content and process categories. This does not amount to double weighting as the
points in different categories are not added together but reported separately as a profile.
Holistic scoring involves considering each response as a single entity and comparing its
quality with 'benchpoint' descriptors and sample responses. Such a scheme many be
devised empirically, by categorising actual responses, then by writing descriptions which
aim to capture the essential features that are common to categories. Figure 7 offers a
sample holistic scheme for the same task and Figure 8 gives a sample piece of work
pointed according to both schemes.
The advantage of this method is that the score assigned is focused on the overall
quality of the response, which is not always captured by point scoring. The disadvantage
is that the scorer has to internalise more information and the judgments are broader and
more subjective. Though holistic scoring has been used for high-stakes assessment in
Mathematics (New Standards 1995), and is common in other subject areas, its main
strength in the UK Mathematics is probably for formative assessment, where a generic
(i.e. not task specific) scoring scheme can cover any tasks teachers or students choose.
7
In the design of Key Stage Tests and GCSE in Mathematics, designers face specific constraints.
“Design tasks with 20% of the points on each Attainment Target, and with 30% at level 4, 40% at
level 5, 30% at level 6”. Both the products and experience show that such constraints inhibit
designers and lead inevitably to poorer tasks, notably short items that are a travesty of
performance in mathematics.
Mathematics Assessment Design 20 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Figure 9. A Framework for Balance
(MARS 1998, adapted for QCA Programme of Study)
Time
Myth 1: Testing takes too much time.
Feedback is a vitally important factor in improving performance – in sport, “games” give
purpose to training, the purpose of music practice is to play “gigs” (or concerts for the
less ’cool’). If doing the assessment tasks is also good learning, that is a further
justification for spending time on assessment.
We accept that, at least in the current UK and US political climate, constraints on the
time for high-stakes assessment are inevitable 8 . The issues include:
• How much time is reasonable?
• How can it best be used?
On the first, practice around the world ranges widely – at least from about 40 minutes to
many hours per year. A reasoned decision would relate this to both the time for
structured classroom assessment and for teaching. The pressures for reducing the time
reflect the distaste for substantial testing that many key groups share – though for very
different reasons.
Sampling
Myth 2: Each test should cover all the important mathematics.
In mathematics people say: “We taught (or learned) X but it wasn’t on the test.”
Mathematics is the only subject where there is a tradition of “coverage”, assuming that
all aspects of new content should be assessed every time. This has been at the expense
of any significant assessment of process aspects; once the interaction between process
and content in tasks is recognized, it is clearly impossible to assess the full range of
types of performance. Is this a concern?
Sampling is accepted as the inevitable norm in all other subjects. History examinations,
year-by-year, ask for essays on different aspects of the history curriculum; the same is
true in, say, chemistry; final examinations in literature or poetry courses do not expect
students to write about every set book or poem studied. It is accepted that a given
examination should:
• Sample the domain of knowledge and performance
• Vary the sample from year to year, so that teaching addresses the whole domain
• Emphasise aspects that are of general importance, notably the process aspects
However, the balance of the sampling is crucial, discussed in Sections 2 and 6,
answering
Myth 3: “We don’t test that but, of course, all good teachers teach it.”
8
Comprehensive evaluation in the 1980s of the “100% coursework” assessment schemes for
English, with no timed tests, showed that they satisfied all reasonable requirements of reliability
and fairness; however, the then Prime Minister’s decision to outlaw them reflected a widespread
“common sense” perception to the contrary. Mathematics coursework has suffered from
inadequate engineering.
Mathematics Assessment Design 22 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
Accuracy
Myth 4: Tests are precision instruments.
They are not, as test-producers’ fine print sometimes makes clear. Mathematics
examiners have long been proud of the consistency between different pointers – the
typical point-repoint variation is often just a point or two, much less than in some other
subjects. However, that is not a fair measure of accuracy of what a student knows,
understand and can do; that must include the test-retest variation. Testing and then
retesting the same student on parallel forms, “equated” to the same standard, should
produce the same scores. In fact, they are likely to be substantially different. There is a
reluctance to publicise test-retest variation, or even to measure it 9 . Most estimates
suggest that the total uncertainty is roughly one grade – on retest, a Grade C is as likely
to be either a B or a D as a C. This fact is ignored by policy makers who know that
measurement uncertainty is not politically palatable, when life-changing decisions are
made on the basis of test scores.
Tests are fair, but in the same way as a lottery is fair.
What are the implications of this for assessment design in Mathematics? The drive for
“precision” has led to narrow de facto assessment objectives and simplistic tests. This is
clearly pointless – the true uncertainties remain high and the price paid from unbalanced
assessment is as unnecessary as it is harmful. Mathematics should be content with
point-repoint variation comparable with other subjects, notably English, which command
public confidence and respect. With the kinds of task described in this paper, that is
readily achieved.
9
One rigorous recent study (Gardner 2002??), looked at an “eleven plus” test in Northern
Ireland, a traditional test with important “consequences”. Testing the same cohort with two
equaivalent tests, they found that, of half the 6000 students who “passed” on one test, about
3000 would be different if the other test were used. Pressure was placed on the authors, and their
university, not to publish the results
Mathematics Assessment Design 23 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan
development by several teams working in parallel to a common brief is the approach
most likely to yield high-quality outcomes 10 .
Pace of change: “big bang” v incremental When a problem is identified, there is a
political urge to solve it. When the solution implies profound changes, particularly in the
well-grooved practice of professionals, the rate at which they can change without
corruption of the intentions is an important factor. The level of support, of proven
effectiveness, is also important. If the pace of change required is too great (an empirical
question), important aspects of the planned change may simply “disappear into the
sand”. This has been a recurrent problem with “big bang” implementations. An
alternative model (Joint Matriculation Board/Shell Centre 1982-86), based on regular
well-supported incremental changes, has proven effective in implementing profound
changes over a few year period using aligned materials for assessment, teaching and
professional development. This proved effective and popular with teachers – a
significant factor in faithful implementation. It also avoids any public upheaval in tried
and trusted examinations – a crucial advantage. In the current context, this implies
introducing new task types gradually over a period of a few years until the target
balance is achieved.
The challenge to the examination boards The English exam boards have both public
service and commercial roles, with an expectation that they will deliver their
examinations and results impeccably. While the former makes them anxious to “do the
right thing”, they are likely to lose market share if their tests are perceived as more
difficult – a likely effect of any profound change. Thus they all have a commercial
incentive to minimise any such change.
What might be done to avoid to distortion of the outcomes that policy seeks? A number
of possibilities are worth considering. The design of the new types of task might be
assigned to other agencies. The selection of a mandated proportion of tasks from this
“bank” could be left to individual boards, or could be given to another national body.
It could be decided that all providers include in their papers the same set of novel tasks.
Common tasks of this kind would remove any question of a “race to the bottom”. It
would have the additional advantage of providing some comparative information on the
overall standards of the different boards’ examination based on a subset of tasks of high
validity. (This might allow QCA to do less micromanagement of board examinations.)
The common questions approach was considered a decade ago; while no insuperable
objections emerged, it was unpopular with the exam boards, who are proud of their
independence and omnicompetence.
10
The Bowland Trust, with DCSF support, has taken this kind of approach to the development of
“case studies” on real problem solving– teaching units that closely reflect the Programme of Study,
are supported by a linked professional development package. Assessment development is next.
Mathematics Assessment Design 24 DRAFT FOR ISDDE – May 2008
Burkhardt and Swan