mathematical completency demands of assessment item
mathematical completency demands of assessment item
https://doi.org/10.1007/s10763-017-9870-y
Received: 29 April 2017 / Accepted: 14 November 2017 / Published online: 8 December 2017
# Ministry of Science and Technology, Taiwan 2017
* Andreas Pettersen
[email protected]
Johan Braeken
[email protected]
1
Department of Teacher Education and School Research, University of Oslo, Blindern,
P.O. Box 1099, N-0317 Oslo, Norway
2
Centre for Educational Measurement, Faculty of Educational Sciences, University of Oslo,
Blindern, P.O. Box 1161, N-0318 Oslo, Norway
406 A. Pettersen, J. Braeken
Introduction
1
In this paper the term mathematical competence refers to a general definition of what it means to master
mathematics (e.g. the description provided by Niss and Højgaard), while the term mathematical competency
(or competencies in plural) refers to one or a set of the constituent parts of mathematical competence.
Mathematical Competency Demands of Assessment Items: a Search for...... 407
1995). Messick (1995) noted that such empirical evidence can be derived from various
sources. One such source is psychometric models that link theoretical cognitive models
with empirical measurements (Embretson & Gorin, 2001; Koeppen et al., 2008;
Messick, 1995). The modelling of item difficulty has been used as validity evidence
for many tests (see e.g. Enright, Morley & Sheehan, 2002; Gorin & Embretson, 2006),
and identifying features of test items that influence item difficulty is important to
understand what is measured in tests (De Boeck, Cho & Wilson, 2016; Graf,
Peterson, Steffen & Lawless, 2005). For instance, Embretson and Daniel (2008) scored
mathematic test items for cognitive complexity on 12 variables (e.g. number of
equations that had to be recalled and number of subgoals involved in the solution
process) and linked these variables to item performance by using an explanatory item
response model. They found that most cognitive complexity variables (such as subgoal
count) were significant predictors of item difficulty (but equation recall count was not),
and that these cognitive complexity variables could account for about half of the
variance in item difficulty. Embretson and Daniel (2008) argued that these results
supported the validity of the postulated model of cognitive complexity for mathemat-
ical problem solving.
The present study uses psychometrical models to look for validity evidence by linking a
theoretical competency framework to empirical measurements of mathematical com-
petence. The study builds on the work by the PISA mathematics expert group (MEG)
who developed and studied an item analysis scheme to identify the competency
demands of mathematical problems. The development of the item analysis scheme
was based on the concept of mathematical competence that has underpinned the PISA
mathematics frameworks (Turner, Blum & Niss, 2015) and that evolved in parallel and
intertwined with the KOM framework (for details of the relationship between the
concept of competencies in the different frameworks, see Niss (2015)).
The MEG item analysis scheme consists of operational definitions of six mathemat-
ical competencies (see Table 1) which are a modified version of the competencies in the
KOM framework where the mathematical thinking competency and the reasoning
competency have been merged into Reasoning and argument, and the Aids and tools
competency has been omitted (Turner et al., 2015). The scheme also includes descrip-
tions of four different levels of demand for each competency ranging from 0 (lowest
demand) to 3 (highest demand) (for the full item analysis scheme, see Turner, Blum and
Niss (2015)). The results from regression analysis of PISA 2003 and PISA 2006 data
showed that the rated levels of competency demands could account for a considerable
amount of the variance in item difficulty (Turner, Dossey, Blum & Niss, 2013). We
expand on this initial study in three ways:
Whereas the original MEG study focused solely on 48 common items in the PISA 2003
and 2006 mathematics surveys, two different assessments were used in this study: the
PISA 2012 mathematics survey (84 items administered in Norway) and the Norwegian
2014 national mathematics exam (56 items). Both assessments have been developed to
408 A. Pettersen, J. Braeken
Table 1 Definitions of the six mathematical competencies in the item analysis scheme (Turner et al., 2015)
Communication. Reading and interpreting statements, questions, instructions, tasks, images and objects;
imagining and understanding the situation presented and making sense of the information provided
including the mathematical terms referred to; presenting and explaining one’s mathematical work or
reasoning.
Devising strategies. Selecting or devising a mathematical strategy to solve a problem as well as monitoring and
controlling implementation of the strategy.
Mathematising. Translating an extra-mathematical situation into a mathematical model, interpreting outcomes
from using a model in relation to the problem situation or validating the adequacy of the model in relation to
the problem situation.
Representation. Decoding, translating between and making use of given mathematical representations in
pursuit of a solution; selecting or devising representations to capture the situation or to present one’s work.
Symbols and formalism. Understanding and implementing mathematical procedures and language (including
symbolic expressions, arithmetic and algebraic operations), using the mathematical conventions and rules
that govern them; activating and using knowledge of definitions, results, rules and formal systems.
Reasoning and argument. Drawing inferences by using logically rooted thought processes that explore and
connect problem elements to form, scrutinise or justify arguments and conclusions.
In the original MEG study, the PISA MEG which had developed the item analysis
scheme rated the mathematical competency demands of the items. The MEG’s famil-
iarity with the PISA items and their empirical difficulty could potentially bias their
ratings of competency demands and, in turn, the study results. To verify whether the
results are valid even with a new set of raters who are unfamiliar with the items, we
used a group of five mathematics and prospective mathematics teachers (subsequently
called teachers) to establish the competency demands of the items (for further infor-
mation about this group and the rating procedure, see Pettersen and Nortvedt (2017)).
In the original MEG study, estimates of the item difficulty were first computed, and
then, these estimates were treated as true difficulties and used as outcome variables in a
Mathematical Competency Demands of Assessment Items: a Search for...... 409
linear regression analysis. This two-step approach ignores the estimation uncertainty
and makes limited use of the actual item response data with an effective sample size of
48, the number of items. Therefore, this approach risks making some inferences
unreliable. An explanatory item response model is a psychometric model that allows
the cognitive demands of items to be treated as item attributes (i.e. explanatory factors)
that are directly related to the success rates on the items (De Boeck et al., 2016). The
psychometric model is in spirit similar to the two-step approach from the MEG study;
however, it integrates all steps into one and effectively uses all available item response
data (all responses of each student on each item). By building on and extending the
Turner et al.’s (2013) study, the present study further explored how the rated mathe-
matical competency demands in assessment items are related to the empirical difficulty
of items. Specifically, the following core research question was addressed: To what
extent do differences in teacher-rated demands of the six MEG competencies in
mathematics assessment items align with the differences in the empirical item
difficulty?
Method
Figure 1 shows the research design of the present study. Pettersen and Nortvedt (2017))
presented the teachers’ ratings of the mathematical competency demands of the items
(Fig. 1, left). The current study included student responses to the items, and it applied
an explanatory item response modelling approach to investigate the relationship be-
tween the differences in the teacher-rated competency demands of the items (ΔX) and
the differences in empirical item difficulty (Δβ). Details of each assessment are given
below. Then, we outline a procedure to establish the teacher-rated competency demands
of the items as well as a statistical analysis approach to study the relation between
competency demands and empirical item difficulties.
Current study
∆X ∆β
Rated
Item
Competency
Responses
Demands Items
Y
X
Teachers Students
PISA 2012
The PISA 2012 assessment aimed to measure 15-year-old students’ (i.e. students born
in 1996) mathematical literacy as defined in the PISA 2012 mathematics assessment
framework (OECD, 2013). According to this framework, the six mathematical compe-
tencies in the MEG item analysis scheme underpin mathematical literacy in practice
(OECD, 2013). The PISA 2012 paper-based mathematics assessment was administered
to a representative sample of approximately 4700 Norwegian students (OECD, 2014),
and it consisted of a total of 84 items. Note that PISA is a system-level assessment, and
it strategically distributes items across students by using a rotating booklet design to
limit the testing time and effort (OECD, 2014). In Norway, every PISA mathematics
item was assigned to between 1398 and 1452 students. In the original coding, student
responses were dichotomously coded as full credit (1) and no credit (0) for 76 of the
items, and eight items used full (2), partial (1) and no (0) credit codes. For ease of
comparison across items, we recoded partial credits to no credit such that all student
responses were dichotomously scored.
Statistical Analysis
A similar analysis procedure was followed for both assessments. First, descriptive
statistics on the teacher-rated competency demands across items were given. Second,
an explanatory item response modelling approach was used to study the relationship
between the rated competency demands of the items and their empirical difficulty. A
range of item response models with different item predictors was fitted using the lme4
package (Bates, Mächler, Bolker & Walker, 2015) in the open-source statistical soft-
ware R (R Core Team, 2016) to investigate how the rated competency demands could
explain item difficulties.
In an item response model such as the one-parameter logistic item response model, the
probability of a person p correctly responding to an item i (Ypi = 1) is modelled as a
function of the ability of the person (θp) and the difficulty of the item (βi):
exp θp −β i
Pr Y pi ¼ 1jθp ; βi ¼ ;
1 þ exp θp −β i
The person ability is estimated based on her or his performance on the test, while the
item difficulty is estimated based on the performances of all persons on that item. The
person abilities and item difficulties are placed on the same scale, and as can be seen
from the function above, a higher person ability (θp) relative to the item difficulty (βi)
(either due to a more able person, an easier item or both) leads to a higher chance of a
correct response. For persons with ability equal to the item difficulty (θp=βi), the chance
of a correct response on that item is 50%. The explanatory extension used in this study
K
can be seen as adding an additional regression layer to the model, βi ¼ ∑ X ik bk þ εi
k¼1
(Janssen, Schepers & Peres, 2004). This extension allows the item difficulty to be
predicted based on K item predictors such as the rated demands of the six mathematical
competencies X1 to X6:
K
exp θp − ∑ X ik bk þ εi
Pr Y pi ¼ 1jθp ; βi ¼ k¼1 :
K
1 þ exp θp − ∑ X ik bk þ εi
k¼1
412 A. Pettersen, J. Braeken
Missing Data
For both assessments, missing responses were coded as 0 if students had attempted to
solve at least one of the subsequent items. Consecutive missing responses clustered at
the end of the tests were regarded as not reached and subsequently omitted, except for
the first value in the missing series, which was coded as 0. This treatment of missing
data is common in large-scale assessments (e.g. in PISA, OECD, 2014), and it is
performed to avoid inflating the item difficulty due to confounding by the time
constraints of the test and working speed of students. For the Norwegian exam, the
missing data in the two parts were treated separately, as they each had an individual
time frame.
Results
PISA 2012
Table 3 shows the descriptive statistics on the five teachers’ ratings of the mathematical
competency demands of the PISA items. The intraclass correlation coefficients (ICC)
indicated that the teachers had a rather high agreement in their ratings of each of the six
competency demands. The means of the rated competency demands were between 0.56
and 1.22 for the PISA items. The distribution of the averaged teacher ratings are
displayed in Fig. 4 and shows that Communication and Reasoning and argumentation
were the highest-rated competencies in the PISA items and that most ratings for all
competencies were located at lower levels (levels 0 and 1). The ratings of competency
demands tended to be moderately positively intercorrelated (median r = .39), indicat-
ing some overlap in the teacher-rated competency demands across the different com-
petencies (see Table 3). Although the correlation between Communication and
Mathematical Competency Demands of Assessment Items: a Search for...... 413
Devising strategies was somewhat lower (r = .19), the real exception to the rule was
Representation, the ratings for which were much less correlated with the ratings for the
five other competencies (median r = − .08).
Null Model The average 15-year-old Norwegian student has 42% chance of correctly
responding to an average PISA 2012 mathematics item. It was found that 21 and 35%
of the variation in response was attributable to individual student differences and
individual item differences, respectively. Therefore, although there is quite some
variation in student ability (σ2θ = 1.61), there is even more variation in the item
difficulty (σ2β = 2.68) for PISA 2012; these latter differences in item difficulty we
hypothesised to be linked to differences in rated mathematical competency demands.
Full Model In the second stage, we assessed the impact of the mathematical compe-
tency demands in their full context by fitting an explanatory item response model with
the rated demands for all six competencies as item predictors to explain the variation in
item difficulty (see Table 4). This full model outperformed the null model and each
single-predictor model, indicating that taking into account the full item competency
demands profile improves the prediction of item difficulty. Around 55% of the variance
in item difficulty could be explained by variation in competency demands; this
represents a 25% increase compared to the explanatory power of a single competency
demand. In line with expectations, the effect of each competency demand was positive,
meaning that competency demands increase with increased item difficulty. The stron-
gest predictors for item difficulty in the full model were Reasoning and argument,
Symbols and formalism and Devising strategies. Surprisingly, upon keeping the other
competency demand ratings constant, a one level increase in rated demand for
2
AIC Akaike’s information criterion (Akaike, 1973) balances absolute fit to the data with model complexity
in terms of the number of parameters. Best model is a model that is parsimonious but still fits well.
414 A. Pettersen, J. Braeken
Representation related to an increase of half a point in item difficulty on the logit scale.
Therefore, while the demand for Representation on its own is not informative (cf.
single-predictor model), it becomes relevant once it is seen in the context provided by
the other competency demands. Conversely, the contributions of Communication and
Mathematising were no longer significantly different from zero in the context of the full
competency demands profile due to the multicollinearity with these other
competencies.
Figures 2 and 3 show two of the PISA mathematics items included in the current
study. Table 2 shows the averaged teacher-rated mathematical competency demands,
the predicted difficulty (based on the rated competency demands and regression
coefficients [1.8 × .5 + 0.8 × .73 + 0.8 × .01 + 1.0 × .49 + 2.2 × .88 + 1.2 × .99 − 2.81 =
2.30]) and the empirical difficulty of the two items. For ‘drip rate’, the predicted and
empirical item difficulties are rather equal (see Table 2). This indicates that for this
item, the model adequately captures the difficulty of the item based on its rated
competency demands. For ‘sauce’, we observe that the empirical difficulty is lower
than the predicted difficulty (Table 2). This means that the item was less difficult for the
students than what was expected based on the rated competency demands of the item.
Holistic Models In the third stage, we assessed the explanatory power of the simple
holistic summary scores of the six competency demand ratings. With 21% of explained
variance in item difficulty, using the number of competency demands rated above 1 did
not prove fruitful for providing a practical summary that reflects the item difficulty (see
Table 4). However, this percentage increases to 41% when using the number of
competency demands rated above 0 as predictor. With 53% of explained variance in
item difficulty, summing up the rated demands across the six competencies provides a
sum score that matches the predictive performance of the full model (55%). As
supported by the AIC model comparisons, the sum-across-competencies model is an
equally well fitting, but more parsimonious model than the full model. Therefore, from
a practical viewpoint, the sum of the mathematical competency demands might be a
good and intuitive indicator for teachers to estimate the difficulty of items similar to
those in the PISA assessment. For every point increase in demands, the odds of
correctly responding to the item decreases by a factor of 2 (i.e. exp.(.61) = 1.84). In
terms of probability, this means that an average student is predicted to have a proba-
bility of correctly responding to around 93, 81, 57 and 29% for an item with total
competency demand ratings of 0, 2, 4 and 6, respectively.
Table 2 Averaged teacher-rated competency demands, predicted item difficulty and empirical item difficulty
of the two example items ‘drip rate’ (Fig. 2) and ‘sauce’ (Fig. 3)
Item Teacher-rated competency demands Predicted item difficulty Empirical item difficulty
C DS M R SF RA
Drip rate 1.8 0.8 0.8 1.0 2.2 1.2 2.30 2.31
Sauce 0.6 1.0 0.2 0.4 0.6 1.2 0.13 − 0.80
Table 3 Descriptive statistics of the five teachers’ competency demand ratings: agreement measures (ICC), means, standard deviations (SD) and correlations between competencies
ICC values are reproduced from Pettersen and Nortvedt (2017). Means, standard deviations and correlations are based on the average ratings of the five teachers and prospective
teachers
A. Pettersen, J. Braeken
Mathematical Competency Demands of Assessment Items: a Search for...... 417
Fig. 4 Boxplots of the distribution of the averaged teacher ratings of mathematical competency demands
Null Model The average Norwegian grade 10 student has around 54% chance of
correctly solving an item of average difficulty on the 2014 national mathematics exam.
It was found that 31 and 33% of the variation in responses was attributable to individual
student differences and individual item differences, respectively. Therefore, the levels
of variation in student ability (σ2θ = 2.85) and item difficulty (σ2β = 3.08) are rather
similar; these latter differences in item difficulty we hypothesised to be linked to
differences in mathematical competency demands.
Single-Predictor Models Six explanatory item response models, each using one
competency demand as an item predictor, were fitted to examine the extent to which
a single competency demand on its own could explain the variation in item difficulty
(see Table 4). According to both the AIC values (see Table 4) and Chi-square likelihood
ratio tests, all explanatory item response models showed significantly better fit than the
null model, except for the Representation demand model. For the five other single-
predictor models, the differences in the teacher-rated demands for each competency
accounted for 13 – 28% of the variation in item difficulty. For instance, a 1-point rating
increase in the demand for Symbolism and formalism goes together with a 5.40 (i.e.
exp.(b1(k))) multiplicative decrease in the odds of correctly responding to the item. In
terms of probability, this would mean that an average student would have an 88%
chance of responding correctly to an item rated at level 0 for Symbolism and formal-
ism, whereas this chance would decrease to 58 and 21% for an item rated at levels 1
and 2, respectively.
Full Model An explanatory item response model with the demand of all six compe-
tencies as item predictors was fitted to explain the variation in item difficulty (see
Table 4). According to the Chi-square likelihood ratio test and AIC, this full model
outperformed both the null model and the single-predictor models. In total, the rated
demands of all six competencies accounted for 48% of the variance in item difficulty.
This is an increase of 20% from the best-performing single competency model.
418 A. Pettersen, J. Braeken
Table 4 Parameter estimates and model fit for the explanatory item response models of the two assessments
based on the teachers’ averaged ratings of competency demands and student responses
b regression coefficients, SE standard errors, R(b)2 % explained variance in item difficulty, AIC Akaike’s
information criterion
**p < .01, *p < .05
a
Fit values for null model
Reasoning and argument and Symbols and formalism were the two prominent compe-
tency demands in the full model that seemed to account for most of the explained
variation in item difficulty. The partial effects of Reasoning and argument and Symbols
and formalism (i.e. keeping the value of the other predictors constant) correspond to a
multiplicative decrease in the odds of correctly responding to the item of around 12 and
6, respectively (i.e. exp.(2.46) and exp.(1.78)). The partial effects of the other compe-
tencies were not statistically significant from zero, and hence, their unique impact
disappeared in the presence of these two strong item predictors.
Holistic Models The explanatory power of three models that represented a more
holistic view of the demand of the items was assessed. The AIC values in Table 4
show that all three models have a worse fit than the full model. Furthermore, the
explanatory powers of the holistic models were rather low compared to that of the full
model, and also lower than that of the best-fitting single-predictor model. Of the three
holistic models, the sum of the demand across all six competencies had the most
explanatory power, accounting for 26% of the variance in item difficulty. The low
predictive power of these models compared to that of the full model is in line with the
observed coefficient imbalance in the full model, where Reasoning and argument and
Symbols and formalism dominate the predictive equation. Combining the latter, two
competency demand ratings constitute a good summary predictor of item demands in
Mathematical Competency Demands of Assessment Items: a Search for...... 419
the Norwegian exam; adding up the remaining competency demand ratings would only
add noise to the signal.
Discussion
the holistic perspective proved more successful due to the more equal distribution of
explanatory power across the competency demands.
Both the PISA assessment and the Norwegian exam aimed to measure general
mathematical competence, and the PISA mathematics framework and Norwegian
curriculum are both influenced by the concept of mathematical competence described
in the KOM framework (Niss, 2015; Valenta et al., 2015). Therefore, even though none
of the assessments was designed to explicitly measure the six mathematical competen-
cies in the MEG framework, we would expect that these competencies should be
relevant to the constructs of mathematical competence represented in both PISA and
the exam. The descriptions of the assessment items in the PISA 2012 mathematics
framework and the 2014 Norwegian exam guidelines indicate somewhat different
operationalisations of mathematical competence in the two assessments. While all
PISA items are situated in real-life contexts developed to measure a wide range of
processes and capabilities (OECD, 2013), the Norwegian exam consists of both
contextualised items aimed at measuring the depth and width of students’ mathematical
competence and non-contextualised items focused on procedural skills (Norwegian
Directorate for Education and Training [Utdanningsdirektoratet], 2014). The results
from the explanatory item response models indicate that the MEG competency frame-
work seems to be better reflected in the PISA items than in the exam items, with more
of the variance in item difficulty being explained and a higher number of competencies
related to item difficulty. The fact that the rated demands for only two of the six
competencies could be related to the difficulty of the exam items raises questions about
the extent to which these items require various cognitive skills and abilities that are
represented in mathematical competence. These results could indicate that the narrow
focus on procedural skills for a rather large proportion of the exam items might be at the
expense of a valid measure of more general mathematical competence as represented
through the six MEG competencies.
The present study provides some promising results for applying an explanatory item
response modelling approach to link mathematical competency demands to item
difficulty. The models show that a substantial amount of the variance in item difficulty
can be explained by the teacher-rated competency demands, indicating the relevance of
the mathematical competencies for solving the assessment items. Nonetheless, about
half of the variance in item difficulty is not explained by the rated competency
demands. A part of this unexplained variance could be related to inaccurate ratings
of the mathematical competency demands and difficulties with distinguishing the
demand for different competencies. More rater training and rater experience and clearly
defined and concrete categories have been recognised as factors that can increase the
discrimination between distinct concepts and dimensions (Feeley, 2002). Thus, further
revisions of the scheme (e.g. clarifying definitions and descriptions) and guiding
material (e.g. more items to exemplify differences between the competencies and levels
of demand) could yield more accurate ratings that better reflect the empirical difficulty
of the items. In addition, a more exhaustive training (for instance by expanding from
one to two full days of training) where more time was spent on applying the scheme,
422 A. Pettersen, J. Braeken
comparing ratings and discussing differences could improve the raters’ ability to
distinguish between several levels of demand. Nonetheless, it is likely that most of
the unexplained variance is related to item features that are not related to the six
mathematical competencies addressed in our study. Identifying features that influence
the item difficulty is important for understanding what is being measured in assess-
ments and for ensuring that construct-irrelevant item attributes do not threaten the
validity of the interpretations of the test scores (De Boeck et al., 2016; Graf et al.,
2005). Although the variance in item difficulty cannot be fully accounted for by any set
of item features, it is probable that item features exist that are relevant to mathematical
competence but not captured in the MEG scheme. Turner et al. (2013, p. 24) noted that
the mathematical competencies included in the item analysis scheme ‘describe the
essential activities when solving mathematical problems’ and were based on a recon-
figuration of the KOM competencies. One obvious difference between the KOM
framework and the MEG framework is that the Aids and tools competency, which
was not relevant for the early paper-based PISA items, is not a part of the MEG
competencies. In part 2 of the Norwegian exam, the students are required to use digital
tools (e.g. spreadsheets) for solving some of the items (Norwegian Directorate for
Education and Training [Utdanningsdirektoratet], 2014). As this activity is not captured
by the teachers’ ratings, the demands for aids and tools might explain some of the
variances in student performance in the exam data. However, by inspecting residuals
between the empirical and the estimated item difficulties based on the modelled
competency demands, exam items that require the use of digital tools do not appear
to be more divergent than the other items. This indicates that the inclusion of an aids
and tools demand in the item response models would likely not influence the results to
a large extent.
When investigating the empirical separability of cognitive and content domains in
mathematical competencies, Harks, Klieme, Hartig and Leiss (2014) found that the
competencies included in their study were content-specific; the level of demand for the
competencies depended on the mathematical content in which they appear. One
characteristic of the KOM and MEG competency frameworks that distinguishes them
from many other mathematical frameworks is the absence of content domains (e.g.
algebra, geometry and measurement). Previous studies that have investigated features
that influence the difficulty of mathematics items are mostly situated within a certain
topic area. For instance, both Enright et al. (2002) and Koedinger and Nathan (2004)
investigated the factors that influenced the difficulty of word problems related to
quantity. Although the competencies in the MEG framework are defined to overarch
mathematical topics and content, their activation could be more or less demanding
within certain domains or areas. It is therefore possible that including information about
content or context as item attributes could account for some of the unexplained
variances.
Wilson, De Boeck and Carstensen (2008) distinguished between planned and
unplanned variation in item properties, where the former refers to items that are
developed to systematically vary with regard to the properties of interest, and the latter
is based on post hoc analysis of items in an extant test. Studies using planned variation
have shown that a large proportion of the variance in item difficulty can be explained
by the varied item features (Daniel & Embretson, 2010; Enright et al., 2002). In the
case of PISA and the Norwegian exam, the variation in mathematical competency
Mathematical Competency Demands of Assessment Items: a Search for...... 423
demands is planned on the conceptual level, whereas the implementation has been
much more ad hoc with a more indirect impact on the development of the assessment
items. Therefore, a next step to follow-up on this research is to replicate the study with a
more systematically designed assessment that follows through the competency frame-
work from concept to implemented items. The latter might prove to be the real
challenge. One might wonder whether it is practically feasible to design item sets that
tap into each combination of both competencies and levels of demand or that isolate the
demand of different competencies.
Concluding Remarks
Zlatkin-Troitschanskaia, Shavelson and Kuhn (2015) stress that the conceptual model
of competency should dictate the nature of the psychometric models and not the other
way around. Although only a subsample of the six mathematical competencies is
identified empirically and seems to add to the explanatory power in the item response
models, reducing the theoretical competency framework to include only a subsample of
competencies would be an improper interpretation and use of empirical data. For
instance, the importance of communication and representations is recognised in math-
ematics frameworks, curricula and assessments around the world (Niss et al., 2016),
and removing these would lead to a limited view of mathematical competence. Rather,
empirical data from psychometric modelling should be used to inform and further
develop theoretical models of mathematical competence, as well as to improve assess-
ments and operationalisation of competencies in assessment items. From this perspec-
tive, we call not only for the continued conceptual development of the competency
framework but also for a more systematic development of assessment items that are
intended to tap into these competencies. The latter area has been neglected for too long
and is vital to ensure valid measures of mathematical competencies that are aligned
with the present goals of mathematics education.
Acknowledgements The authors would like to thank Ross Turner for his support, feedback and contribution
of material during this study. Further, we would also like to thank the teachers and prospective teachers for
their contribution, the Norwegian PISA Group for allowing access to the PISA material and the Norwegian
Directorate for Education and Training for access to the national exam material.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov
& F. Csáki (Eds.), Second international symposium on information theory (pp. 267–281). Budapest,
Hungary: Aademiai Kiado.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4.
Journal of Statistical Software, 67, 1–48. https://doi.org/10.18637/jss.v067.i01.
Blömeke, S., Gustafsson, J.-E., & Shavelson, R. J. (2015). Beyond dichotomies: Viewing competence as a
continuum. Zeitschrift für Psychologie, 223(1), 3–13.
Boesen, J., Helenius, O., Bergqvist, E., Bergqvist, T., Lithner, J., Palm, T., & Palmberg, B. (2014). Developing
mathematical competence: From the intended to the enacted curriculum. The Journal of Mathematical
Behavior, 33, 72–87.
424 A. Pettersen, J. Braeken
Boesen, J., Lithner, J. & Palm, T. (2016). Assessing mathematical competencies: An analysis of Swedish
national mathematics tests. Scandinavian Journal of Educational Research, 1–16. https://doi.org/10.1080
/00313831.2016.1212256.
Daniel, R. C., & Embretson, S. E. (2010). Designing cognitive complexity in mathematical problem-solving
items. Applied Psychological Measurement, 34(5), 348–364.
De Boeck, P., Cho, S. J., & Wilson, M. (2016). Explanatory item response models. In A. A. Rupp & J. P.
Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applica-
tions (pp. 249–268). Hoboken: Wiley.
Duval, R. (2006). A cognitive analysis of problems of comprehension in a learning of mathematics.
Educational Studies in Mathematics, 61(1), 103–131.
Elia, I., Panaoura, A., Eracleous, A., & Gagatsis, A. (2007). Relations between secondary pupils’ conceptions
about functions and problem solving in different representations. International Journal of Science and
Mathematics Education, 5(3), 533–556.
Embretson, S. E., & Daniel, R. C. (2008). Understanding and quantifying cognitive complexity level in
mathematical problem solving items. Psychology Science, 50(3), 328–344.
Embretson, S. E., & Gorin, J. (2001). Improving construct validity with cognitive psychology principles.
Journal of Educational Measurement, 38(4), 343–368.
Enright, M. K., Morley, M., & Sheehan, K. M. (2002). Items by design: The impact of systematic feature
variation on item statistical characteristics. Applied Measurement in Education, 15(1), 49–74.
Feeley, T. H. (2002). Comment on halo effects in rating and evaluation research. Human Communication
Research, 28(4), 578–586. https://doi.org/10.1111/j.1468-2958.2002.tb00825.x.
Gorin, J. S., & Embretson, S. E. (2006). Item difficulty modeling of paragraph comprehension items. Applied
Psychological Measurement, 30(5), 394–411.
Graf, E. A., Peterson, S., Steffen, M., & Lawless, R. (2005). Psychometric and cognitive analysis as a basis
for the design and revision of quantitative item models (No. RR-05-25). Princeton: Educational Testing
Service.
Harks, B., Klieme, E., Hartig, J., & Leiss, D. (2014). Separating cognitive and content domains in mathe-
matical competence. Educational Assessment, 19(4), 243–266.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and item group predictors. In P. De Boeck &
M. Wilson (Eds.), Explanatory item response models (pp. 189–212). New York: Springer.
Kilpatrick, J. (2014). Competency frameworks in mathematics education. In S. Lerman (Ed.), Encyclopedia of
mathematics education (pp. 85–87). Dordrecht, The Netherlands: Springer.
Koedinger, K. R., & Nathan, M. J. (2004). The real story behind story problems: Effects of representations on
quantitative reasoning. The Journal of the Learning Sciences, 13(2), 129–164.
Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and
assessment. Zeitschrift für Psychologie, 216(2), 61–73.
Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking? Educational
Measurement: Issues and Practice, 23(3), 6–14.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses
and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749.
National Council of Teachers of Mathematics (NCTM). (2000). Principles and standards for school mathe-
matics. Reston: NCTM.
Niss, M. (2007). Reflections on the state of and trends in research on mathematics teaching and learning. In F.
K. J. Lester (Ed.), Second handbook of research on mathematics teaching and learning (pp. 1293–1312).
Charlotte, NC: Information Age.
Niss, M. (2015). Mathematical competencies and PISA. In K. Stacey & R. Turner (Eds.), Assessing
mathematical literacy: The PISA experience (pp. 35–55): Heidelberg: Springer.
Niss, M., Bruder, R., Planas, N., Turner, R., & Villa-Ochoa, J. A. (2016). Survey team on: Conceptualisation
of the role of competencies, knowing and knowledge in mathematics education research. ZDM, 48(5),
611–632.
Niss, M., & Højgaard, T. (Eds.). (2011). Competencies and mathematical learning. Denmark: Roskilde
University.
Norwegian Directorate for Education and Training [Utdanningsdirektoratet]. (2014). Eksamensveiledning - om
vurdering av eksamensbesvarelser. MAT0010 Matematikk. Sentralt gitt skriftlig eksamen. Grunnskole
[Manual - to be used to assess exam papers. MAT0010 Mathematics. National written exam, end of
compulsory education]. Oslo: Utdanningsdirektoratet.
Organization for Economic Co-operation and Development (OECD). (2013). PISA 2012 Assessment and
analytical framework: Mathematics, reading, science, problem solving and financial literacy. Paris:
OECD Publishing. https://doi.org/10.1787/9789264190511-en.
Mathematical Competency Demands of Assessment Items: a Search for...... 425
Organization for Economic Co-operation and Development (OECD). (2014). PISA 2012 technical report.
Paris: OECD Publishing. Retrieved from https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-
report-final.pdf
Pettersen, A., & Nortvedt, G. A. (2017). Identifying competency demands in mathematical tasks: recognising
what matters. International Journal of Science and Mathematics Education. https://doi.org/10.1007
/s10763-017-9807-5.
R Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R
Foundation for Statistical Computing. Retrieved from http://www.R-project.org
Stylianou, D. A. (2011). An examination of middle school students’ representation practices in mathematical
problem solving through the lens of expert work: Towards an organizing scheme. Educational Studies in
Mathematics, 76(3), 265–280.
Turner, R., Blum, W., & Niss, M. (2015). Using competencies to explain mathematical item demand: A work
in progress. In K. Stacey & R. Turner (Eds.), Assessing mathematical literacy: The PISA experience (pp.
85–115). New York: Springer.
Turner, R., Dossey, J., Blum, W., & Niss, M. (2013). Using mathematical competencies to predict item
difficulty in PISA: A MEG study. In M. Prenzel, M. Kobarg, K. Schöps, & S. Rönnebeck (Eds.),
Research on PISA (pp. 23–37). New York: Springer.
Valenta, A., Nosrati, M., & Wæge, K. (2015). Skisse av den «ideelle læreplan i matematikk» [Draft of the
«ideal curriculum in mathematics»]. Trondheim: Nasjonalt senter for matematikk i opplæringen.
Retrieved from https://nettsteder.regjeringen.no/fremtidensskole/files/2014/05/Skisse-av-den-
ideellel%C3%A6replanen-i-matematikk.pdf.
Wilson, M., De Boeck, P., & Carstensen, C. H. (2008). Explanatory item response models: A brief
introduction. In E. Klieme & D. Leutner (Eds.), Assessment of competencies in educational contexts:
State of the art and future prospects (pp. 91–120). Göttingen: Hogrefe & Huber.
Zlatkin-Troitschanskaia, O., Shavelson, R. J., & Kuhn, C. (2015). The international state of research on
measurement of competency in higher education. Studies in Higher Education, 40(3), 393–411.
International Journal of Science & Mathematics Education is a copyright of Springer, 2019.
All Rights Reserved.